SETI @ UNC Computer Science

Welcome to our page about how we contribute to the SETI@Home project. We're not running it at home--- we're running it on a machine in the Computer Science Department at UNC that has spare cycles (mostly at night). On this page, we explain what the hardware is, how we run it all automatically to favor anyone else who has anything to do (even run a runaway emacs processes), who we are, why our average time to finish a work unit is lower than average, and some musings on the accumulated stats at the SETI@Home site.

Hardware

Our primary hardware is an SGI Onyx2 (aka an Origin2000 with graphics accelerators). It has 32 processors, each of which is a MIPS R12000 processor running at 300 mHz (read below for reasons why a 300, or even 180, mHz processor can outperform a 500 mHz PentiumIII). We run the Irix6.2 binary compiled with the mips3 instruction set. For a little while, we ran the Irix6.4-mips4 binary, which ran more than twice as fast, but produced unusable results.

We also contribute to SETI@Home with several other machines--- a few Sun sparcs, a macintosh, a few PCs, and some HP workstations. Not one of these is nearly as fast as a single processor on the Oynx2, so the contributions do not even add up to 5% of our total contribution.

Who

Our group is composed of several people from the Computer Science department, and some that we've never heard of (who are you guys?). Two of us study parallel applications for a living, and have a side interest in the performance of the SETI@Home application.

People Involved
Lars Nyland http://www.cs.unc.edu/~nyland/
Jan Prins http://www.cs.unc.edu/~prins/
Others Who are you?

Why is the R12k processor so much faster than a PentiumIII?

Most people assume that the clock rate of a processor is the most important factor in determining its computing power. The clock rate, expressed as millions of clock ticks per second, or megahertz, is only one indicator of a processors computing power. It is simply the rate at which the processor can start doing the "next" thing, whatever that means for a particular family of processors (e.g. PentiumII, MIPS R12000, Alpha, PowerPC, etc.). The next question to ask after you know the clock rate is "what can the processor start doing each time?"

Pipelining

Most modern processors are pipelined. What this means is that several instructions can be processed concurrently, with each instruction at a different phase (all are "in-flight"). An example of a long, non-computer pipeline is an automobile assembly line. A new car may roll off the assembly line every 30 seconds, but the time for the first car (and every car thereafter) to make it from start to finish takes days. This leads to 2 quantities: the restart rate, and the latency. The restart rate states how soon a particular instruction can be rerun (or new car can be begun), usually in terms of clock ticks. The latency measures how long it takes to complete the execution of an instruction (time to build a complete car from start to finish). Each instruction has its own latency and restart rate that must be considered when determining performance.

On agressive processor implementations, the restart rate is 1. Most processors do not achieve this for all instructions, especially the complicated floating-point instructions like divide and square root. The latency is usually some constant, like 3, 5 or 12, representing common stages of instructions (decode, read, compute, store, branch).

Instruction Reordering

Most modern processors have an instruction pool from which they select "ready" instructions. An instruction is ready when all of its operands are available. Instructions are decoded and put into the pool, and once executed, they are handed to a reordering device that restores the order so that your program appears to be executed in the order it was written.

Multiple-issue Floating Point Hardware

Most modern processors have several units that can each execute ready instructions. For instance, a Pentium III has 2 ALUs and a floating-point (FP) unit that can all execute instructions simultaneously. This feature is also referred to as superscalar architecture.

Cache

In order to bridge the gap between the fast CPU clock rate and the much lower speed that memory values can be transferred to the CPU, all modern CPUs have "cache" memory. Usually it comes in 2 flavors--- the fastest is "Level I" and is on the CPU chip itself with a typical size of 8K to 64K; Level II is usually off-chip, substantially larger and a bit slower, with capacities between 512K and 8M.

Cache is usually N-way associative, which means that there are N different places that can be used to store each real memory location. In fully-associative cache, any memory address can be located anywhere (this is the ideal case). Most caches are 2 or 3-way associative. The speed gain is substantial when desired memory values are cached. If a value is in L1 cache, it typically can be used by the CPU just 1 clock tick after it has been requested. With prefetching, the value can be in the CPU when it is needed with no delay. If it happens to be in L2, the delay is something like 6--10 ticks. If it is in neither, the delay to get it from main memory is anywhere from 25 ticks to several hundred, depending on the memory architecture.

Pentium vs. MIPS

If you dig into the Pentium and R12000 architecture manuals, you'll find out why the MIPS R12000 processor can run SETI@Home faster, despite a slower clock rate.

On the Pentium III, an FADD (FP-add) can be started every clock cycle, and has a latency of 3 cycles. Just as you would hope. The FMULL instruction has a latency of 5 cycles, but can only be issued every other clock cycle. And finally, the FDIV instruction, in single-precision, has a latency and reissue rate of 18 cycles (32 for double, and 38 for extended). This instruction is not pipelined (thus the large reissue rate), which means that no other floating-point operation can be in the pipeline. The Pentium III has only one FPU.

On the MIPS R12000, there are 2 FPUs, 2 ALUs, and a load/store unit, allowing up to 5 instructions to be issued per clock cycle. It can speculatively execute up to 4 branches deep, and has instruction reordering for better CPU utilization. It also has 64 integer and 64 FP registers that are dynamically assigned from the non-busy pool. There is a branch prediction unit with a branch prediction table (assoc. mem) that holds 2048 branch-target addresses.

There is an FP-add unit, an FP-multiply unit and an FP-divide/square root unit (I see 3; how do they count 2?). The add and multiply units can execute instructions with a 2-cycle latency at a 1-cycle reissue rate. The FDIV instruction is not pipelined, and has a latency of 12 cycles with a repeat rate of 14 (why the extra 2?).

Finally, the cache size differs drastically between the Pentium and the MIPS R12K. The Pentium II dropped its L2 cache to 512K from the Pentium Pro's larger value of 1M, although the newer PIIIs can have 512K, 1M, or 2M of L2 cache. The MIPS R12K processors we use have an L2 cache size of 8M.The difference this makes is that all of the data can reside in cache on the MIPS CPU, but on the Pentium, it must be occasionally fetched from main memory.

If the instruction stream is optimally pipelined (all operands available when the instruction is ready to be executed), and the reissue rate is 1, then the latency is completely hidden--- this is rarely the case. The extra processing hardware, larger cache memory, shorter latency and faster reissue capabilities of the mips processor allow it to execute about 3-4 times as many fp adds and multiplies per cycle compared to a Pentium III. But given the slower clock rate, the speed improvement is closer to a factor of 2 in actual execution speed. Thus, our PentiumIIs (dual Xeons at 400 mhz) complete one WU in about 9 hours, while a single MIPS R12000 (at 300 mhz) does the same work in 4 hours. Our PentiumIIIs running at 600 mhz improve only marginally over the 400 mhz Xeon PIIs, finishing one WU in 8.5 hours.

SETI@Home Accumulated Stats

Continuous contributors to SETI@Home run the application as much as possible on as many machines as they can get their hands on without annoying anyone. This leads to a roughly constant contribution (over time) of work units completed. Our contribution is typically 100+ WUs per night during the week, with 200-300 over the weekend.

All the SETI@Home contributors started contributing at different times, but as time passes, we notice a stabilization in the top-100 statistics. This is because the slope of our line representing our contributions over time has minimized the effect of the date when we started.

Each contributor can roughly compute their accomplishment on a particular date with an equation such as:

y = mx - b

where m is the contribution rate (say work units per week), x is the number of weeks since the SETI@Home program has been available publically, and b is the number of work units you could have run between the time SETI@Home started and when you found out about it (this also equals m * weeks missed).

In our case, m might be about 800 WU/week, while b might be as large as 10000.

We all started at different times and the differing b values were the primary indicator of our relative ranks. Now, however, the slope m dominates, and all participants slowly diverge while maintaining their relative status. Every so often, a new group or user will start up and have a large value of b due to their late entry, but may also have a large value of m. As they contribute at their rate of m, they will eventually pass all other contributors with lower values of m (given enough time). They will always be parallel to other contributors with equal values of m, and will never catch contributors with larger values of m. Thus the rankings of steadily contributing members will eventually stabilize. The picture below is meant to show a graphical version of the words above. You can see that once the variations in the starting times work themselves out, all the contributors will maintain their same ranking (and only grow further and further apart).

Running SETI@Home without annoying others

Overview

Our Onyx2's primary goal in life is to run interactive 3D graphics, not scientific computational hog jobs, and certainly not computations for others. Of prime importance is keeping the graphics guys from getting annoyed with computational hogs. So how do we keep them happy and still manage to get 100+ WUs done per night?

I wrote a Perl script that pokes around from time to time, looking for an idle processor. During the day (9-5), it thinks there is only 1 cpu available, so if the load average is above 0.00, there are no SETI@Home jobs. In the evening, the number of cpus available goes up to 16 or 24, and from midnight to 8am, it tries to use all 32. But it doesn't take all of them, just the ones it thinks are unused. Here's how it figures out how many Seti@Home processes to run.

If N is the target number of cpus, and LA is the current load average (1 minute), then the number of SETI@Home jobs, S, to run is:

S = N - ceiling(LA)

In the evening, N = 24 and the load average might be 3.24. Thus, we try to run S = 24 - ceiling(3.24), which is 20. By using the 1-minute load average, we quickly kill jobs when some user starts his work. People have reported that this is very responsive when they are running parallel 'make's or time-consuming renderers. This script has been quite stable now for several months; the only thing I have to do is restart it after the Oynx2 is rebooted.

Details

This section lists all the details, and gives a link to the Perl script, nice_seti. It is very Unix oriented, so the unix-wary should probably go on to something else. Even though Perl runs just fine under Windows and MacOS, nice_seti relies on the existence of particular commands (some available only on Irix), so it won't do well elsewhere.

The script expects a certain file structure, which looks a little different than you might expect. In our Seti@Home directory, we have sub-directories from xy00 to xyPP, where PP = #processors-1 on the machine, and 'xy' is the first 2 letters of the machine name running the Seti@Home executable. The name of the Seti@Home executable in the directory xyPP is setiPP, so we can match the output of 'ps' with a particular 'state.txt' file. The script nice_seti sits above all of these in our Seti@Home directory. On our machine, named evans, our seti directory looks like

%  ls -l
total 456
   8 drwxrwxr-x    2 nyland            4096 Nov 26 10:44 ev00/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 21:13 ev01/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 19:20 ev02/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 19:01 ev03/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 19:33 ev04/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 19:04 ev05/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 17:04 ev06/
   8 drwxrwxr-x    2 nyland            4096 Nov 29 19:29 ev07/
...
  16 -rwxrwxr-x    1 nyland            4370 Nov 29 10:10 nice_seti*
 128 -rw-r--r--    1 nyland           13392 Nov 29 22:03 seti.out
The script is based on the idea that for a given hour during the day (0..23), there is a ideal number of processes to run, and the script knows the difference between weekdays and weekend days. The ideal value is reduced by the current load average (rounded up, or ceiling). The basic trip through the loop consists of the following steps:

To start Seti@Home processes, we use the Irix 'npri' command and its -w flag that runs processes as "weightless." This makes them equivalent to the idle thread, in that they are only run when there is absolutely nothing else for a CPU to do. The load created by weightless processes is not added to the load average, so the load of the Seti@Home jobs does not need to be removed from the value reported by the 'uptime' command (as it would on other Unix platforms).

There is some disagreement about the impact of jobs run with the 'npri -w' command. The manual and some anecdotal information indicate that these jobs do not appear to have any impact on jobs with a finite 'nice' value. Performance problems on our SGI have been blamed the existence of these jobs, but a thorough investigation has never been done. Thus, to keep interactive users happy, the ideal number of processors is typically 1 during the day to keep all blame away from our Seti@Home processes.

The script also depends on the Irix/Unix commands 'hinv' to get the number of processors, 'date' to get the day-name and hour, 'uptime' to get the 1, 5, and 15-minute load averages, and 'uname' to get the name of the system.

There is also a fudge-factor so that jobs are not killed for incidental work created by deamons such as nfsd, telnetd, or ftpd. We assume that a steady load average of 0.15 (or k+0.15) or less probably indicates "incidental" work that will go away the next time we ask. Thus we don't really kill any jobs until we exceed the target value by more than 1.15, not 1. This keeps small variations in the load average from constantly killing and restarting our Seti@Home jobs. Real work, such as complations, interactive graphics, or large rendering jobs immediately bump the load average by more than 0.15, so these jobs cause almost instant killing of Seti@Home jobs.

One more note: The Perl script nice_seti seems to need the latest, greatest version of Perl. We use the one in /usr/freeware/bin. I think it is not the version of Perl, but the version of some of the support files, such as those that handle process id information.

- Revised: Mon Nov 29 21:30:12 1999 by nyland@cs.unc.edu