Design Project: Universal Emulator

Dean Herington
COMP 265
May 2, 2001

0. Introduction

For my design project I have chosen to design a universal emulator. As directed, I have envisioned the universal emulator as a special coprocessor card attached to a general-purpose host computer via its backplane.

Purpose

The purpose of the universal emulator is to enable the host computer to which it is attached to emulate other computer architectures more effectively. More precisely, the universal emulator should allow its host processor to execute more quickly a range of application programs written for different architectures from that of the host, yet without the need for nontrivial translation or preparation of those applications for the new environment.

Overview

Due to the nature of emulation, an application for an emulation machine consists of two parts: an emulation program that runs on the emulation engine to allow for emulation of a specific computer architecture, and an application program written for that architecture that runs on the emulation program/engine pair.

The emulation machine (also called simply the "emulator") has working storage and memory, much like a general-purpose computer. These resources house the emulation program's code and data, and thereby also the application program's working storage. The code and other data of the application program, however, are stored in host memory. It is assumed that the emulator can efficiently read and write host memory. The emulator also relies on the host for supervisory function such as input/output.

To initiate an emulation, the host downloads an initial state image to the emulator's memory and working storage. Included in this state image is information on the allocation of host memory for use by the application being emulated. After the state image is loaded, execution begins at a predetermined location in emulator memory. The emulator interrupts the host when it requires supervisory service or when it has completed execution of the application.

1. Application Characteristics

A good design for a special-purpose computer must accommodate and exploit the characteristics of its intended applications. What are those characteristics for the universal emulator?

First, it must be noted that, because we desire to emulate general-purpose computer architectures, we can't define very narrowly the characteristics of the application programs that will be run. Fortunately, application programs of widely varying kinds show great similarity among themselves. We can therefore afford to concentrate on the characteristics of the emulation programs that will run on the emulator engine.

The characteristics of the emulation programs flow directly from the architectures they represent. Again fortunately, computer architectures tend to be more similar than different. We can derive the following characteristics for the emulation programs.

Size of Store

Here are some rough estimates of the size of working and control store for some representative machines selected from the Zoo in Blaauw and Brooks:
       section  machine   approx. num. bits

        10.5    Univac I        960
        11.4    IBM 704         160
        12.1    IBM 650         220
        12.4    IBM 360         960
        13.3    IBM Stretch    2100
        14.1    Univac 1103A    140
        14.2    CDC 6600       1056
        14.4    Cray 1        40000
        15.1    PDP 8            70
        15.2    PDP 11          480
        15.3    VAX             768
        16.1    Intel 8080A     100
        16.4    MC68000         640
Most of these machines require well below 200 bytes of working store. The Stretch requires somewhat more, and the Cray 1 much more.

Frequency of Operations

In this section I derive a rough estimate for the frequency of operations for an emulation program running on the emulator engine.

Let us start with the emulation of a conventional modern (RISC) processor. Through a subjective combination of the data from Figures 1.17, 2.11, and 2.26 in Patterson and Hennessy, we arrive at a mix of instruction classes something like this:

        35%     load/store
        50%     alu operations (mostly simple)
        15%     control operations (mostly conditional branch)
Let us now consider the operations involved in emulating a typical instruction from each of these classes.
    load rd,rx,disp
                        load instruction
                        bump instruction address
                        dispatch on opcode
                        look up index register
                        add index and displacement
                        load/store data
                        branch to next cycle

        Adding in 4 field extractions for the 4 parts of the
        instruction, we get:

                2  memory references
                3  alu operations
                2  branches
                4  extractions


    alu rd,rs,rt
                        load instruction
                        bump instruction address
                        dispatch on opcode
                        operate
                        branch to next cycle

        Adding in 4 field extractions for the 4 parts of the
        instruction, we get:

                1  memory reference
                2  alu operations
                2  branches
                4  extractions


    cbr rs,rt,disp
                        load instruction
                        bump instruction address
                        dispatch on opcode
                        operate
                        conditionally update instruction address
                        branch to next cycle

        Adding in 4 field extractions for the 4 parts of the
        instruction, we get:

                1  memory reference
                3  alu operations
                2  branches
                4  extractions
Taking the assumed weights into account, our simplified instruction mix is:
                14%  memory reference
                25%  alu operation
                20%  branch
                41%  extraction
Given the characteristics described earlier, we estimate the proportions of calls and returns, input/output and other supervisory operations, and manipulations of numbers in other than twos-complement form to be negligible at this level of analysis.

2. Name Spaces

A word is 64 bits wide. A halfword is, therefore, 32 bits wide. There is no special byte size.

Emulator address space is 2^18 halfwords. The address space is cyclic. The (installed) memory space may be less. The memory space is linear. (Addresses above installed memory are invalid.)

The host address space accessed by the emulator is (conceptually) 2^64 bits. The installed memory space is (obviously) much less. Again, the 64-bit memory space is cyclic but the lesser address space is linear.

The emulator working storage consists of:

The decimal carry bits are used by the decimal adjust instructions. Decimal carry bit n represents the carry out of the nth 4-bit substring, counting from the least significant end, during twos-complement addition or subtraction. (Under this definition, carry could also be known as dcarry16.)

In exception mode (i.e., when exc_mode is 1), register 0 acts like all other general registers. When not in exception mode, however, reading register 0 always results in zero, and writing it has no effect.

There is no address space embedding.

There is no backing store. Memory and I/O management are handled by the host machine.

[Addressing is treated in section 4.]

3. Data Formats

Character

No special support for characters or character strings is provided. (The general bit-handling capabilities serve adequately.) No character encoding is assumed.

Logical

Classical logical connectives apply to 64-bit logical vectors. Variable-length field extraction and insertion capabilities facilitate manipulation of logical vectors of sizes from 1 to 64 bits.

Fixed-Point Numbers

64-bit integers use twos-complement notation. Variable-length field extraction and insertion capabilities, plus variable-length integer overflow checking, facilitate manipulation of fixed-point numbers of sizes from 1 to 64 bits.

Unsigned integers represented as strings of 16 binary-coded-decimal digits can be added and subtracted by using decimal adjustment instructions after the normal twos-complement operations.

Facilities for extended-precision integer arithmetic are provided. Carry in and carry out can be controlled on addition and subtraction. Multiplication of two 64-bit integers produces a 128-bit product. Division of a 128-bit dividend by a 64-bit divisor produces a 64-bit quotient and a 64-bit remainder.

Floating-Point Numbers

A floating-point number occupies a register pair and has the following format. (See also Figure 1.)
     0:0        sign (0 for positive, 1 for negative)
     0:1-2      class code
     0:3-31     unused
     0:32-63    exponent (twos-complement integer)
     1:0-63     coefficient (unsigned integer)
The class code has the following meaning.
   0  zero      The value represented is zero.  The coefficient and
                exponent (but not necessarily the sign) are also zero.
   1  normal    The value represented is
                  (-1)^sign * coefficient * 2^(exponent-64)
                That is, the binary point is taken to be at the left
                end of the coefficient.
   2  infinite  The value represented is infinite.  The coefficient
                is zero.  (The sign is meaningful.)
   3  NaN       The value is "not a number".  The coefficient indicates
                information about the exceptional condition.

4. Instruction Formats

All instructions are one halfword (32 bits) in length. There are three basic instruction formats. (See Figure 2.)

Each format starts with an 8-bit opcode. These 8 bits include all opcode information for the instruction.

Following the opcode are one, two, three, or four 6-bit fields, each of which specifies either a general register address or an immediate value, depending on the instruction. In the event that these 6-bit fields do not exhaust the 32-bit instruction word, one final field follows, of length 18, 12, or 6 bits according to the format, that is either unused or an immediate. The number of "addresses" in an instruction can thus range between one and four.

Addressing

Emulator Memory

There are three, simple addressing formats for emulator memory. (See Figure 3.)

Emulator Registers

Unusually, due to the nature of emulation, there are two formats for addressing emulator registers.

Some instructions manipulate 128-bit data items. Such a double-length data item is stored in a register pair. A register pair is a pair of consecutively numbered registers where the lower number is even. A register pair is specified by specifying its lower-numbered element.

Host Memory

Host memory is accessed with a single addressing format. (See Figure 4.) The address is computed as the sum of a 64-bit bit address contained in a register and an unsigned, 6-bit immediate bit displacement. (This format uses instruction format 3.)

There is considerable flexibility in the mapping of emulated programs onto host memory. See the decode_address, encode_address, increment_host_address, and decrement_host_address instructions.

5. Data Flow Diagram

See Figure 5.

6. Operation Repertoire

Given the rather large number of instructions, their descriptions are perhaps briefer than is necessary to convey complete precision. I appeal to the educated reader's understanding of architectural concepts.

Opcode assignments are not given in this document. Because the opcode is encoded in the simplest possible way, no architectural insight would be gained from such assignments.

The instruction name (which maps one-to-one to an opcode) is given first, followed by descriptions of the fields, in order. A field is described by a type separated by a colon from a name. The types are as follows.

The names often convey additional information, as follows.

Here are several more conventions used in the instruction descriptions.

Bit and Field Manipulation

extract_unsigned                 R:dst, R:src, F:fspec
extract_unsigned_imm             R:dst, R:src, W:width, O:offset
Register dst is assigned the zero-extended field of register src.

Note that this instruction subsumes "logical shift right" (except that it cannot shift all 64 bits out of a register). It also can be used to zero-fill the high-order bits of a register.

extract_signed                   R:dst, R:src, F:fspec
extract_signed_imm               R:dst, R:src, W:width, O:offset
Register dst is assigned the sign-extended field of register src.

Note that this instruction subsumes "arithmetic shift right". It also can be used to sign-fill the high-order bits of a register.

make_field                       R:dst, R:src, F:fspec
make_field_imm                   R:dst, R:src, W:width, O:offset
The low-order width bits of register src are shifted left offset bits and the result, with other bits zero, is assigned to register dst.

Note that this instruction subsumes "logical shift left".

replace_field_reg_reg            R:dst, R:src, F:fspec
replace_field_reg_imm            R:dst, R:src, W:width, O:offset
replace_field_imm_reg            R:dst, I6:src, F:fspec
replace_field_imm_imm            R:dst, I6:src, W:width, O:offset
A field is formed from src as for make_field. This field replaces the corresponding field in register dst. When src is immediate, it is treated as signed (MSB-extended).
rotate                           R:dst, R:src, F:fspec
rotate_imm                       R:dst, R:src, U:unused, O:offset
The full contents of register src are rotated right the number of bit positions specified by the offset and the result is assigned to register dst.
count_left_zeros                 R:dst, R:src1, R:src2
count_left_ones                  R:dst, R:src1, R:src2
The number of high-order zeros or ones (depending on the instruction) in register src1 is added to the value of src2 and assigned to register dst.

Logical Operations

These instructions perform the classical logical operations. The "_complement" suffix means that register src2 is complemented before serving as the second operand. All immediates are "signed" (that is, MSB-extended).
and                              R:dst, R:src1, R:src2
and_imm                          R:dst, R:src1, I12:immed
and_complement                   R:dst, R:src1, R:src2
or                               R:dst, R:src1, R:src2
or_imm                           R:dst, R:src1, I12:immed
or_complement                    R:dst, R:src1, R:src2
xor                              R:dst, R:src1, R:src2
xor_imm                          R:dst, R:src1, I12:immed
xor_complement                   R:dst, R:src1, R:src2
Note that logical negation ("not") can be obtained by "xor"ing with all ones.

Literal Generation

Note that register 0 normally holds the value 0. Note also that replace_field_imm_imm can be used to generate literal values.
  
load_positive                    R:dst, I18:const
load_negative                    R:dst, I18:const
These instructions create 18-bit constants with either zero-fill or one-fill in the upper bits.
load_immed_field                 R:dst, O:offset, I12:const
This instruction takes the signed const, shifts it left offset bits, and assigns the resulting value to register dst.

Fixed-Point Arithmetic Operations

These instructions perform the classical arithmetic operations. Immediates are signed unless the instruction is named "_unsigned".
add_imm                          R:dst, R:src1, I12:immed
add                              R:dst, R:src1, R:src2
subtract                         R:dst, R:src1, R:src2
These instructions do not affect any carry or overflow bits.
add_with_carry_out               R:dst, R:src1, R:src2
subtract_with_carry_out          R:dst, R:src1, R:src2
These instructions set all carry and overflow bits appropriately.
add_with_carry_in_and_out        R:dst, R:src1, R:src2
This instruction adds the carry to the sum being formed. It also sets all carry and overflow bits appropriately.
subtract_with_carry_in_and_out   R:dst, R:src1, R:src2
This instruction subtracts the complement of the carry (the "borrow") from the difference being formed. It also sets all carry and overflow bits appropriately.
adjust_decimal_add               R:dst, R:src
It is assumed that the current states of all 16 carries and the value src resulted from an integer addition of two unsigned BCD decimal values. Register dst and carry are set to the values that reflect a true decimal addition of the two BCD decimal values.
adjust_decimal_subtract          R:dst, R:src
It is assumed that the current states of all 16 carries and the value src resulted from an integer subtraction of two unsigned BCD decimal values. Register dst and carry are set to the values that reflect a true decimal subtraction (where a borrow is indicated by the complement of a carry) of the two BCD decimal values.
multiply_unsigned                RR:dst, R:src1, R:src2, R:add
multiply_unsigned_imm            RR:dst, R:src1, I6:src2, R:add
multiply_signed                  RR:dst, R:src1, R:src2, R:add
multiply_signed_imm              RR:dst, R:src1, I6:src2, R:add
The 64-bit multiplicand src1 is multiplied by the 64-bit register or immediate multiplier src2, 64-bit addend add is added, and the 128-bit product is placed in register pair dst. All values are interpreted as unsigned or signed according to the name of the instruction. For the "unsigned" instructions, carry is set according to the carry out of the low 64 bits. For the "signed" instructions, ovflo is set according to the overflow out of the low 64 bits.
divide_unsigned                  R:dst, RR:src1, R:src2, R:rem
divide_unsigned_imm              R:dst, RR:src1, I6:src2, R:rem
divide_signed                    R:dst, RR:src1, R:src2, R:rem
divide_signed_imm                R:dst, RR:src1, I6:src2, R:rem
The 128-bit dividend in register pair src1 is divided by the 64-bit divisor src2. The 64-bit quotient is placed in register dst, and the 64-bit remainder is placed in register rem. Iff the results are not representable, ovflo is set to 1, and the registers updated by the instruction have unpredictable values.
compare                          R:dst, R:src1, R:src2
compare_imm                      R:dst, R:src1, I12:immed
Register src1 is compared to register src2. The results of various comparisons are placed in certain bits of register dst, as follows. (Bits are numbered from the least significant end.) Bits not specified in the accompanying table have undefined values.
       bit          condition         mnemonic

        9 : (unsigned) src1 >= src2     uge
        8 : (unsigned) src1 <  src2     ult
        7 : (unsigned) src1 <= src2     ule
        6 : (unsigned) src1 >  src2     ugt
        5 : (signed)   src1 >= src2     sge
        4 : (signed)   src1 <  src2     slt
        3 : (signed)   src1 <= src2     sle
        2 : (signed)   src1 >  src2     sgt
        1 :            src1 /= src2     ne
        0 :            src1 =  src2     eq
The result is designed for branching with branch_bit_0 and branch_bit_1.

Floating-Point Arithmetic Operations

decode_ieee32                    RR:dst, R:src
decode_ieee64                    RR:dst, R:src
These instructions convert a floating-point value in single- or double-precision IEEE 754 format to a floating-point value in the format of section 3. A double-precision IEEE value occupies the entire register src. A single-precision IEEE value occupies the lower half of register src. The internal-format value occupies register pair dst.
encode_ieee32                    R:dst, RR:src
encode_ieee64                    R:dst, RR:src
These instructions convert a floating-point value in the format of section 3 to a single- or double-precision IEEE 754 format. A double-precision IEEE value occupies the entire register dst. A single-precision IEEE value occupies the lower half of register dst, and the upper half is cleared. The internal-format value occupies register pair src.
add_float                        RR:dst, RR:src1, RR:src2
subtract_float                   RR:dst, RR:src1, RR:src2
multiply_float                   RR:dst, RR:src1, RR:src2
divide_float                     RR:dst, RR:src1, RR:src2
These instructions perform the identified floating-point dyadic arithmetic operations on values in internal format in register pairs src1 and src2, producing a result in register pair dst. IEEE 754 semantics govern, except that no rounding is performed.
round_float                      RR:val, R:info, D12:disp
If the floating-point value in internal format in register pair val has class "normal", it is rounded and placed back in the register pair. Rounding is performed according the specifications in info, whose meaning is as follows. (See also Figure 6.)
      bits      name      meaning
       0-7       --       unused (should be zero)
       8-9      mode      rounding mode:  0  toward nearest
                                          1  toward zero
                                          2  toward -infinity
                                          3  toward +infinity
      10-15     length    number of bits of coefficient to retain
                          (0 means to retain all 64 bits)
      16-39     expmin    minimum exponent for normal class
      40-63     expmax    maximum exponent for normal class
If the value has class "infinite" or "NaN", or if the value has class "normal" and the exponent is outside the specified exponent range, a branch is made according to disp.

Register Indexing

load_register                    R:dst, R:index, I6:disp
The contents of the register whose address is the 6-bit modulus sum of disp and the contents of index is copied to register dst.
store_register                   R:src, R:index, I6:disp
The contents of register src is copied to the register whose address is the 6-bit modulus sum of disp and the contents of index.

Emulator Memory Operations

These instructions perform classical load, store, and address arithmetic operations. Instructions whose names end in "_32" load a 32-bit word into the low 32 bits of the register and clear the high 32 bits.
load_direct_32                   R:dst, I18:disp
load_direct_64                   R:dst, I18:disp
load_indexed_32                  R:dst, R:index, I12:disp
load_indexed_64                  R:dst, R:index, I12:disp
store_direct_32                  R:src, I18:disp
store_direct_64                  R:src, I18:disp
store_indexed_32                 R:src, R:index, I12:disp
store_indexed_64                 R:src, R:index, I12:disp

Host Memory Operations

host_load                        R:contents, R:address, I6:length, I6:disp
host_store                       R:contents, R:address, I6:length, I6:disp
These instructions load or store length bits (where 0 means 64) at the bit address in host memory equal to the (64-bit modulus) sum of address and unsigned disp.

The host memory interface is designed to allow for flexible mapping of emulated programs onto the host's presumed power-of-two-sized memory. Where emulated memory is power-of-two-sized (as for the IBM 360, for example), the mapping is direct and simple. In other cases, such as for a 36-bit machine such as the IBM 704, two distinct approaches can be used.

Note that the general model embodied here in these host access instructions does not place any necessary burden on a host. The emulator hardware will generate a sequence of host memory accesses--appropriate to the host and wrapped with appropriate shifting--to accomplish the semantics described here. In particular, this semantic model can be implemented even for hosts that prohibit unaligned access or provide fewer than 64 bits for each access.

decode_address                   R:dst, R:src, I6:upw, I6:ulen
The value in register src is divided by unsigned upw (units per 64-bit word). (A zero value for upw means 64.) The 58 low bits of quotient are concatenated to the 6-bit product of the remainder and unsigned ulen (the unit length) and the result is placed in register dst. (A zero value for ulen means 64.) This result is the proper bit address for the designated emulated memory unit in host memory. The number of low-order bits of each 64 bits of host memory that are "wasted" is 64 - upw * ulen. It must be that upw * ulen <= 64.
encode_address                   R:dst, R:src, I6:upw, I6:ulen
The product of unsigned upw and the high-order 58 bits of register src are added to the quotient of the low-order 6 bits of src and unsigned ulen and placed in register dst.
increment_host_address           R:dst, R:src, I6:incr, I6:limit
Unsigned incr is added to the low-order 6 bits of the host (bit) address in register src. (A zero value for incr means 64.) If the result equals or exceeds unsigned limit, the low 6 bits are set to zero and the high-order 58 bits are incremented.
decrement_host_address           R:dst, R:src, I6:incr, I6:limit
Unsigned incr is subtracted from the low-order 6 bits of the host (bit) address in register src. (A zero value for incr means 64.) If the result is negative, the low 6 bits are set to unsigned limit - incr and the high-order 58 bits are decremented.

The emulator operates in Big Endian mode. However, in order to ease emulation of Little Endian architectures (albeit at some performance cost), the following instructions are provided.

swap_bytes_2                     R:dst, R:src
swap_bytes_4                     R:dst, R:src
swap_bytes_8                     R:dst, R:src
These instructions swap bytes to account for Little Endian addressing. swap_bytes_2 is for PDP11-style addressing, for data of size 16, 32, and 64. swap_bytes_4 is for 4-byte VAX quantities. swap_bytes_8 is for 8-byte VAX quantities.

Instruction Sequencing

branch_cond                      I6:cond, R:src, D12:disp12
Branch if the condition described by cond is met for register src. The condition is met if any of the following is true:
        cond bit n is 1   and   the value of src is
        ------------------------------------------------
                3               maximum negative
                2               less than zero but not maximum negative
                1               equal to zero
                0               greater than zero
The following mnemonics are assigned to the indicated combinations of the above conditions.
        3 2 1 0    mnemonic

        0 0 0 0
        0 0 0 1      gt0
        0 0 1 0      eq0
        0 0 1 1      ge0
        0 1 0 0
        0 1 0 1      
        0 1 1 0
        0 1 1 1
        1 0 0 0
        1 0 0 1
        1 0 1 0
        1 0 1 1
        1 1 0 0      lt0
        1 1 0 1      ne0
        1 1 1 0      le0
        1 1 1 1
branch_bit_0                     I6:bit, R:src, D12:disp12
branch_bit_1                     I6:bit, R:src, D12:disp12
Branch if bit bit of register src is zero/one. This instruction is especially useful with the result of compare.
branch_direct                    R:link, I18:disp
branch_indexed                   R:link, R:index, I12:disp
Branch unconditionally and place the address of the following instruction into register link. link usually specifies register 0 when the return address is not needed.
dispatch                         I6:log, R:src, W:width, O:offset
Branch unconditionally to the address which is the (18-bit modulus) sum of the address of the next instruction and the result of shifting the designated field of register src left unsigned log bits. log must be less than 8.

That is, a dispatch instruction is followed, in line, by a table of instructions. There are 2^width entries to the table, corresponding to the 2^width possible values for the specified field. These entries occur every 2^log instructions, starting with the first instruction following the dispatch.

bump_branch_<rel>                R:index, R:addend, R:limit, D6:disp
bump_branch_<rel>_imm            R:index, I6:addend, I6:limit, D6:disp
Augment register index by signed addend. Branch if the value of register index before its augmentation is in the designated relation to signed limit. <rel> can be lt, le, gt, ge, eq, or ne.
branch_on_signed_overflow        I6:len, R:src, D12:disp
branch_on_unsigned_overflow      I6:len, R:src, D12:disp
Branch if the value in register src is outside the range of (unsigned) len-bit signed/unsigned values. A zero value for len means 64.

When len is 0 (meaning 64), the result depends on the value of ovflo or carry, for signed and unsigned cases, respectively. Otherwise, the content of src determines the result. For the signed case, a branch results if all bits to the left of the most significant bit don't match the sign bit. For the unsigned case, a branch results if all bits to the left of the most significant bit are not zero.

Supervision

return_from_exception            R:addr
Branch unconditionally to the address in addr and turn exception mode off.
signal_host                      R:request
Request service request from the host computer. See section 8, "Supervision", for descriptions of the defined requests.

7. Instruction Sequencing

See the "Instruction Sequencing" subsection in section 6.

8. Supervision

When certain unexpected conditions arise (such as an invalid host memory access), an exception is raised in the emulator. Exception mode is entered. A code identifying the type of exception and the current instruction address are placed in the upper and lower halves, respectively, of register 0 (which is normally not writable). Control is then passed to emulator memory location 0. The program beginning there can interrogate register zero and act on the exception appropriately.

Because the exception mechanism serves most importantly as an interruption mechanism by which the host can communicate with the emulator, full description of the exception mechanism is deferred to the next section.

9. Host Communication

The emulator and host computer communicate with each other using an interruption scheme. An interruption from the host computer causes the emulator to be diverted from its normal instruction execution. Such an interruption in the emulator is called an "exception", the consequences of which were described in the previous section.

Defined Exceptions

The following exceptions are defined.

Signals

The emulator can also interrupt the host computer. From the emulator's perspective (the perspective of this document), such an interruption is called a "signal". A signal is sent to the host with the signal_host instruction (see section 6).

The following signals are defined.

More specifics of the exception and signaling mechanisms can be seen in the sketch of the IBM 650 emulator in Appendix 4.

Initiation, Termination, Input/Output, Debugging

A program is initiated on the emulator by the host program. The host program loads host memory with a program to be emulated, loads the emulator program and control information into the emulator memory, then raises start_exception in the emulator. The emulator program responds by beginning emulation of the program.

During emulated program execution, I/O requests recognized during emulated instruction execution (such as an RD instruction on the 650) are satisfied by requests sent to the host with signal_host instructions. Upon completion of such a service request, the host computer raises a service_completion_exception in the emulator. See the treatment of the RD and PCH instructions in Appendix 4 for examples. The specifics of such interactions depend on the computer being emulated and the host computer.

When the emulated program has terminated normally, the emulator signals this to the host computer with a service request that indicates emulated program termination.

The emulator includes two registers not part of the programming model, an instruction counter and an instruction limit. These registers are controlled by the host program to provide a rudimentary debugging facility for emulator programs. The instruction counter simply advances by one for each emulator instruction executed. When the instruction counter matches the instruction limit, an exception is raised, and the emulator's exception handler saves the emulator state (in memory visible to the host), signals the host, and waits for further instruction. The host can examine the emulator's state, modify it (including the value of the instruction limit register), and resume execution.

10. Design Rationale

Major Decisions

Three design decisions most strongly shaped the design of the universal emulator.

power-of-two sizes

Use of power-of-two sizes for data units is important for two reasons. First, it makes for a good match to today's hosts and, in particular, their memory sizes. Power-of-two size compatibility allows the host memory address abstraction described under "Host Memory Operations" in section 6 to work relatively smoothly. Second, in the context of the emulator itself, power-of-two sizes lead to greater bit efficiency in specifying fields, a very common occurrence. That is, because the emulator's registers are a power of two in size, instruction bandwidth is not wasted when specifying a bit position in a register.

data width

In order for a universal emulator to emulate a wide variety of architectures successfully, it must have great enough capacity to "cover" most architectures that might be emulated. Architectures have varied widely in their data widths, spanning at least a range from 12-bit to 64-bit words. Although there may be some with data widths greater than 64 bits, they are quite rare. Coupled with the fact that 64-bit data paths have become common in today's technology, the choice for a 64-bit data width for the emulator is very clear. It allows the emulator to handle nearly all architectural widths smoothly, yet does not overtax the current technology.

amount of working store

The third major design decision also has to do with capacity, the size of working storage. For the emulator to be effective in its emulation performance, it seems obvious that it must hold the emulated machine's working storage in register, rather than memory, storage. So it is desirable for the emulator to have lots of registers.

Typical sizes for working stores were examined during the initial investigation. (See section 1, "Application Characteristics".) It was concluded there that most architectures have fewer than around 200 bytes or so of working storage. Assuming 64-bit registers, these register set sizes yield these amounts of working storage:

        number of       working storage
        registers          in bytes

           32                256
           64                512
          128               1024
          256               2048
So it is seen that 32 registers are barely enough, while 256 registers appear to be plenty. The actual number chosen, 64, resulted from secondary issues that are addressed next.

Instruction Format

Having decided on power-of-two sizes and having chosen a data width (64 bits), the next set of important and related decisions dealt with the instruction format. Again there were three key desiderata.

Narrow Instructions

A major decision had to be made between wide, microcode-like instructions, with which much simultaneous control can be exercised, and narrow, RISC-like instructions which are simpler and potentially faster. Given the desire to code emulator programs in a high-level language, the simpler, narrow instructions enjoy a decided advantage. Hence, I opted for narrow instructions for "compilability" and performance.

Three-Operand Capability

The convergent architecture we see today generally provides the option for designating three operands in an instruction. This suits two-operand arithmetic and logical operations very well, as the need to copy register values in order to avoid losing one of the arguments to the operation is much reduced. Coupled with the specification of these three operands in registers, not only is copying reduced, but also the instructions remain reasonably narrow and highly bit-efficient. A secondary reason concerning compilation comes into play here, as well: Compiling for 3-operand formats gives a compiler more flexibility and less trouble in generating good-quality code.

Support for Field Extraction

As the analysis in section 1 shows, field extraction is likely the single most common operation in emulation. This calls for dedicating proportionately greater resources to this operation: instruction space as well as emulator hardware. This led me to adapt the field manipulation model implemented by the Motorola MC88000 microprocessor. At affordable cost in hardware and speed (apparently, given the functioning of the MC88000 processor), a rather powerful field specification model is supported. (See the descriptions of the "Bit and Field Manipulation" instructions in section 6 for details.)

Adopting that model has two strong influences on instruction set design. First, in order to maintain the separation of result register from operand registers, the field manipulation instructions must (at least sometimes) specify four operands! Fortunately, two of these operands are short: of width equal to the ceiling of the log of the number of bits in a register. Of course, with a power-of-two number of registers, the ceiling function makes no difference, and instruction bandwidth is not wasted. Just as importantly, it becomes clear that it is quite desirable that the number of registers be equal to the number of bits in a register, so that the same field of an instruction can comfortably specify either a register address or a bit position within a register.

In the MC88000, this common number was 32. For the emulator, we have already seen that 32 registers would barely be enough. Considering greater numbers, we see that with 64 registers, four operands would fit in a 32-bit instruction, leaving 8 bits for opcode. This allocation seems rather serendipitous, but we'd still prefer more registers. Could 128 registers be made to work? Unfortunately not, as four 7-bit operand specifications would leave a paltry 4 bits for opcode information. So, 64 64-bit registers seems quite an acceptable balance point.

The final step in shaping the instruction format is to order the fields simply left to right and realize that not all instructions need to specify four operands. It is quite natural to let the final needed field "soak up" the remainder of the instruction word's 32 bits, especially when that field serves as an immediate value, typically a shorthand for a more general specification using registers.

Data Format Support

Which data types need what levels of support from the emulator?

Fixed-Point

Fixed-point is clearly necessary. Twos-complement format has become standard, but historical machines sometimes used ones-complement or signed-magnitude formats. Twos-complement format is appropriate for an emulator, not just because of its popularity, but because it is the only format among the three in which all values of the other formats can be represented. Hence, operations in the other formats can be emulated by first converting to twos-complement format, performing the operation in twos-complement form, then converting back to the other format. The only trouble with this approach is that occasionally a negative zero value is required, but this situation is rare.

We choose not to provide conversion instructions among fixed-point formats, because the conversions are simple using already existing instructions (see Appendix 1), and because we don't feel that the frequency of ones-complement and signed-magnitude formats justifies additional instructional support.

Many machines provide multiple-precision fixed-point arithmetic. In addition, there is some chance that we will want to emulate a machine that manipulates fixed-point values larger than 64 bits wide. For both of these reasons, we provide multiple-precision support. (See add_with_carry_out, subtract_with_carry_out, add_with_carry_in_and_out, and subtract_with_carry_in_and_out under "Fixed-Point Arithmetic Operations" in section 6.) Also, multiplications produce a double-length product.

Floating-Point

The case of floating-point is more difficult to deal with. Historically, floating-point formats have shown much greater diversity than fixed-point formats. Nevertheless, many features are common: radix two, signed-magnitude coefficient form, biased exponent. And, as time goes on, the commonality will only increase, due to the widespread and increasing acceptance of the IEEE 754 standard.

Given this situation, one approach would be simply to adopt the IEEE format wholesale. The convergence to the IEEE standard among current architectures would mean that, over time, the emulator's format would be a perfect fit for more and more possible emulation targets. Moreover, in emulating a machine with a different format, one could still use IEEE-format facilities if one were willing to accept less-than-perfect emulation with respect to floating-point precision.

While this approach is reasonable, we prefer not to penalize so sharply those emulation targets whose floating-point format is not IEEE-compliant. Instead, we define a new (and admittedly peculiar) internal floating-point format with two key properties:

Decimal

Decimal types also come in abundant diversity. Fortunately, decimal types are not nearly so important--due to their lower frequency--as floating-point types. We include assist for unsigned decimal addition and subtraction (see adjust_decimal_add and adjust_decimal_subtract under "Fixed-Point Arithmetic Operations" in section 6) because it is relatively inexpensive and because it goes a long way toward easing the burden of emulating decimally oriented machines. (See the sketch of the emulation of the IBM 650 in Appendix 4. As can be seen there, conversion between decimal and fixed-point are only moderately expensive and are certainly tolerable in the emulation of a very old--and slow--machine such as the 650.)

Minor Decisions

This subsection comments briefly on several smaller design decisions.

Handling Diverse Unit Sizes

Architectures have used many different unit sizes through the years. Supporting them smoothly is a real challenge. We choose to provide the flexibility described under "Host Memory Operations" in section 6 to allow the strategy to be tailored to the architecture under emulation. Note that, where a good match exists to the host memory structure, there is little overhead imposed by this general model.

It could be questioned whether trying to handle the complete set of word sizes (from 6 to 95) listed in section 1 is worthwhile. The primary features that facilitate handling various word sizes are the bit and field manipulation instructions. The main reason these instructions are defined so richly is to facilitate decoding of instructions. Once these instructions are present, there is little additional cost (beyond the host memory model mentioned above) to provide graceful manipulation of various word sizes.

Backwards Bit Numbering Within a Register

At first glance, one might wonder why the bits within a register are numbered from right to left, given that the emulator is otherwise a Big Endian machine. This numbering scheme fits better with the field manipulation instructions, and in particular with the specification of the offset attribute. With the backwards bit numbering, the offset values correspond nicely to shift counts, where the extract_unsigned, extract_signed, and make_field instructions take on the roles of "logical right shift", "arithmetic right shift", and "logical left shift" instructions.

Conditional Branch

The specification of the condition on which a conditional branch is based is somewhat unusual. (See the branch_cond instruction in section 6.) I borrowed it directly from the MC88000 architecture, but it perhaps deserves a bit of commentary nonetheless.

The condition specification is essentially "microcoded". Two independent conditions are evaluated: the state of the sign bit of the register and the "zero-ness" of all the other bits in the register. These two conditions give rise to four possibilities, corresponding to the four bits in the specification. I have always imagined that, by encoding the choices this way, the number of logic levels was minimized. This is important in a branch instruction because it is important to decide the direction of the branch as soon as possible. Note that the other conditional branch instructions (branch_bit_0 and branch_bit_1) have even simpler conditions. Put another way, when two comparands are to be compared, the comparison is separated from the branch, ensuring plenty of time to decide the branch direction. Thus, manifesting this low-level specification of the branch condition allows one to keep the cycle time minimal.

With the same goal of reducing cycle time in mind, I chose to specify that the comparison in the bump_branch_... instructions be performed on the value of the index register before its augmentation. This removes the time for an addition from the critical path. Whether this reduction would actually matter is not known. Even after this simplification, the bump_branch_... instructions are significantly more complex than branch_cond, for example.

Sequencing

Applying the RISC philosophy, sequencing is kept as simple as possible. We do recognize the frequency and hence importance of decoding, so we include the dispatch instruction for this purpose. (See "Instruction Sequencing" in section 6.) This instruction implements a "select case" construct.

No Explicit Stack Addressing

We have not included any explicit support for stack addressing because the need for it has not appeared strong in the code sketches we have investigated.

11. Sample Code

In Appendix 2 we give a brief description of the assembly language conventions followed in Appendices 3 and 4.

We show an outline for an emulator for the IBM 360 in Appendix 3. The sketch shows the basic instruction fetch cycle, complete with program error and interrupt checking, including an instruction for each of the RR and RX formats.

In Appendix 4 we give a much more extended sketch of an emulator for the IBM 650. Whereas the IBM 360 matches the emulator in several key respects (including powers-of-two sizes and twos-complement integers), the IBM 650 differs much more markedly, being fully decimal with a word size of 11 decimal digits. The sketch for the 650 also shows much more explicitly how the emulator and host machines interact.

12. Acknowledgements

I would like to acknowledge the MC88000 architecture and its designers (whose names are unknown to me) for the field manipulation and conditional branching models that inspired these areas of the design of the emulator.