The emulation machine (also called simply the "emulator") has working storage and memory, much like a general-purpose computer. These resources house the emulation program's code and data, and thereby also the application program's working storage. The code and other data of the application program, however, are stored in host memory. It is assumed that the emulator can efficiently read and write host memory. The emulator also relies on the host for supervisory function such as input/output.
To initiate an emulation, the host downloads an initial state image to the emulator's memory and working storage. Included in this state image is information on the allocation of host memory for use by the application being emulated. After the state image is loaded, execution begins at a predetermined location in emulator memory. The emulator interrupts the host when it requires supervisory service or when it has completed execution of the application.
First, it must be noted that, because we desire to emulate general-purpose computer architectures, we can't define very narrowly the characteristics of the application programs that will be run. Fortunately, application programs of widely varying kinds show great similarity among themselves. We can therefore afford to concentrate on the characteristics of the emulation programs that will run on the emulator engine.
The characteristics of the emulation programs flow directly from the architectures they represent. Again fortunately, computer architectures tend to be more similar than different. We can derive the following characteristics for the emulation programs.
section machine approx. num. bits
10.5 Univac I 960
11.4 IBM 704 160
12.1 IBM 650 220
12.4 IBM 360 960
13.3 IBM Stretch 2100
14.1 Univac 1103A 140
14.2 CDC 6600 1056
14.4 Cray 1 40000
15.1 PDP 8 70
15.2 PDP 11 480
15.3 VAX 768
16.1 Intel 8080A 100
16.4 MC68000 640
Most of these machines require well below 200 bytes of working store.
The Stretch requires somewhat more, and the Cray 1 much more.
Let us start with the emulation of a conventional modern (RISC) processor. Through a subjective combination of the data from Figures 1.17, 2.11, and 2.26 in Patterson and Hennessy, we arrive at a mix of instruction classes something like this:
35% load/store
50% alu operations (mostly simple)
15% control operations (mostly conditional branch)
Let us now consider the operations involved in emulating a typical
instruction from each of these classes.
load rd,rx,disp
load instruction
bump instruction address
dispatch on opcode
look up index register
add index and displacement
load/store data
branch to next cycle
Adding in 4 field extractions for the 4 parts of the
instruction, we get:
2 memory references
3 alu operations
2 branches
4 extractions
alu rd,rs,rt
load instruction
bump instruction address
dispatch on opcode
operate
branch to next cycle
Adding in 4 field extractions for the 4 parts of the
instruction, we get:
1 memory reference
2 alu operations
2 branches
4 extractions
cbr rs,rt,disp
load instruction
bump instruction address
dispatch on opcode
operate
conditionally update instruction address
branch to next cycle
Adding in 4 field extractions for the 4 parts of the
instruction, we get:
1 memory reference
3 alu operations
2 branches
4 extractions
Taking the assumed weights into account, our simplified instruction
mix is:
14% memory reference
25% alu operation
20% branch
41% extraction
Given the characteristics described earlier, we estimate the
proportions of calls and returns, input/output and other supervisory
operations, and manipulations of numbers in other than twos-complement
form to be negligible at this level of analysis.
Emulator address space is 2^18 halfwords. The address space is cyclic. The (installed) memory space may be less. The memory space is linear. (Addresses above installed memory are invalid.)
The host address space accessed by the emulator is (conceptually) 2^64 bits. The installed memory space is (obviously) much less. Again, the 64-bit memory space is cyclic but the lesser address space is linear.
The emulator working storage consists of:
In exception mode (i.e., when exc_mode is 1), register 0 acts like all other general registers. When not in exception mode, however, reading register 0 always results in zero, and writing it has no effect.
There is no address space embedding.
There is no backing store. Memory and I/O management are handled by the host machine.
[Addressing is treated in section 4.]
Unsigned integers represented as strings of 16 binary-coded-decimal digits can be added and subtracted by using decimal adjustment instructions after the normal twos-complement operations.
Facilities for extended-precision integer arithmetic are provided. Carry in and carry out can be controlled on addition and subtraction. Multiplication of two 64-bit integers produces a 128-bit product. Division of a 128-bit dividend by a 64-bit divisor produces a 64-bit quotient and a 64-bit remainder.
0:0 sign (0 for positive, 1 for negative)
0:1-2 class code
0:3-31 unused
0:32-63 exponent (twos-complement integer)
1:0-63 coefficient (unsigned integer)
The class code has the following meaning.
0 zero The value represented is zero. The coefficient and
exponent (but not necessarily the sign) are also zero.
1 normal The value represented is
(-1)^sign * coefficient * 2^(exponent-64)
That is, the binary point is taken to be at the left
end of the coefficient.
2 infinite The value represented is infinite. The coefficient
is zero. (The sign is meaningful.)
3 NaN The value is "not a number". The coefficient indicates
information about the exceptional condition.
Each format starts with an 8-bit opcode. These 8 bits include all opcode information for the instruction.
Following the opcode are one, two, three, or four 6-bit fields, each of which specifies either a general register address or an immediate value, depending on the instruction. In the event that these 6-bit fields do not exhaust the 32-bit instruction word, one final field follows, of length 18, 12, or 6 bits according to the format, that is either unused or an immediate. The number of "addresses" in an instruction can thus range between one and four.
Some instructions manipulate 128-bit data items. Such a double-length data item is stored in a register pair. A register pair is a pair of consecutively numbered registers where the lower number is even. A register pair is specified by specifying its lower-numbered element.
There is considerable flexibility in the mapping of emulated programs onto host memory. See the decode_address, encode_address, increment_host_address, and decrement_host_address instructions.
Opcode assignments are not given in this document. Because the opcode is encoded in the simplest possible way, no architectural insight would be gained from such assignments.
The instruction name (which maps one-to-one to an opcode) is given first, followed by descriptions of the fields, in order. A field is described by a type separated by a colon from a name. The types are as follows.
The names often convey additional information, as follows.
Here are several more conventions used in the instruction descriptions.
extract_unsigned R:dst, R:src, F:fspec extract_unsigned_imm R:dst, R:src, W:width, O:offsetRegister dst is assigned the zero-extended field of register src.
Note that this instruction subsumes "logical shift right" (except that it cannot shift all 64 bits out of a register). It also can be used to zero-fill the high-order bits of a register.
extract_signed R:dst, R:src, F:fspec extract_signed_imm R:dst, R:src, W:width, O:offsetRegister dst is assigned the sign-extended field of register src.
Note that this instruction subsumes "arithmetic shift right". It also can be used to sign-fill the high-order bits of a register.
make_field R:dst, R:src, F:fspec make_field_imm R:dst, R:src, W:width, O:offsetThe low-order width bits of register src are shifted left offset bits and the result, with other bits zero, is assigned to register dst.
Note that this instruction subsumes "logical shift left".
replace_field_reg_reg R:dst, R:src, F:fspec replace_field_reg_imm R:dst, R:src, W:width, O:offset replace_field_imm_reg R:dst, I6:src, F:fspec replace_field_imm_imm R:dst, I6:src, W:width, O:offsetA field is formed from src as for make_field. This field replaces the corresponding field in register dst. When src is immediate, it is treated as signed (MSB-extended).
rotate R:dst, R:src, F:fspec rotate_imm R:dst, R:src, U:unused, O:offsetThe full contents of register src are rotated right the number of bit positions specified by the offset and the result is assigned to register dst.
count_left_zeros R:dst, R:src1, R:src2 count_left_ones R:dst, R:src1, R:src2The number of high-order zeros or ones (depending on the instruction) in register src1 is added to the value of src2 and assigned to register dst.
and R:dst, R:src1, R:src2 and_imm R:dst, R:src1, I12:immed and_complement R:dst, R:src1, R:src2 or R:dst, R:src1, R:src2 or_imm R:dst, R:src1, I12:immed or_complement R:dst, R:src1, R:src2 xor R:dst, R:src1, R:src2 xor_imm R:dst, R:src1, I12:immed xor_complement R:dst, R:src1, R:src2Note that logical negation ("not") can be obtained by "xor"ing with all ones.
load_positive R:dst, I18:const load_negative R:dst, I18:constThese instructions create 18-bit constants with either zero-fill or one-fill in the upper bits.
load_immed_field R:dst, O:offset, I12:constThis instruction takes the signed const, shifts it left offset bits, and assigns the resulting value to register dst.
add_imm R:dst, R:src1, I12:immed add R:dst, R:src1, R:src2 subtract R:dst, R:src1, R:src2These instructions do not affect any carry or overflow bits.
add_with_carry_out R:dst, R:src1, R:src2 subtract_with_carry_out R:dst, R:src1, R:src2These instructions set all carry and overflow bits appropriately.
add_with_carry_in_and_out R:dst, R:src1, R:src2This instruction adds the carry to the sum being formed. It also sets all carry and overflow bits appropriately.
subtract_with_carry_in_and_out R:dst, R:src1, R:src2This instruction subtracts the complement of the carry (the "borrow") from the difference being formed. It also sets all carry and overflow bits appropriately.
adjust_decimal_add R:dst, R:srcIt is assumed that the current states of all 16 carries and the value src resulted from an integer addition of two unsigned BCD decimal values. Register dst and carry are set to the values that reflect a true decimal addition of the two BCD decimal values.
adjust_decimal_subtract R:dst, R:srcIt is assumed that the current states of all 16 carries and the value src resulted from an integer subtraction of two unsigned BCD decimal values. Register dst and carry are set to the values that reflect a true decimal subtraction (where a borrow is indicated by the complement of a carry) of the two BCD decimal values.
multiply_unsigned RR:dst, R:src1, R:src2, R:add multiply_unsigned_imm RR:dst, R:src1, I6:src2, R:add multiply_signed RR:dst, R:src1, R:src2, R:add multiply_signed_imm RR:dst, R:src1, I6:src2, R:addThe 64-bit multiplicand src1 is multiplied by the 64-bit register or immediate multiplier src2, 64-bit addend add is added, and the 128-bit product is placed in register pair dst. All values are interpreted as unsigned or signed according to the name of the instruction. For the "unsigned" instructions, carry is set according to the carry out of the low 64 bits. For the "signed" instructions, ovflo is set according to the overflow out of the low 64 bits.
divide_unsigned R:dst, RR:src1, R:src2, R:rem divide_unsigned_imm R:dst, RR:src1, I6:src2, R:rem divide_signed R:dst, RR:src1, R:src2, R:rem divide_signed_imm R:dst, RR:src1, I6:src2, R:remThe 128-bit dividend in register pair src1 is divided by the 64-bit divisor src2. The 64-bit quotient is placed in register dst, and the 64-bit remainder is placed in register rem. Iff the results are not representable, ovflo is set to 1, and the registers updated by the instruction have unpredictable values.
compare R:dst, R:src1, R:src2 compare_imm R:dst, R:src1, I12:immedRegister src1 is compared to register src2. The results of various comparisons are placed in certain bits of register dst, as follows. (Bits are numbered from the least significant end.) Bits not specified in the accompanying table have undefined values.
bit condition mnemonic
9 : (unsigned) src1 >= src2 uge
8 : (unsigned) src1 < src2 ult
7 : (unsigned) src1 <= src2 ule
6 : (unsigned) src1 > src2 ugt
5 : (signed) src1 >= src2 sge
4 : (signed) src1 < src2 slt
3 : (signed) src1 <= src2 sle
2 : (signed) src1 > src2 sgt
1 : src1 /= src2 ne
0 : src1 = src2 eq
The result is designed for branching with branch_bit_0 and
branch_bit_1.
decode_ieee32 RR:dst, R:src decode_ieee64 RR:dst, R:srcThese instructions convert a floating-point value in single- or double-precision IEEE 754 format to a floating-point value in the format of section 3. A double-precision IEEE value occupies the entire register src. A single-precision IEEE value occupies the lower half of register src. The internal-format value occupies register pair dst.
encode_ieee32 R:dst, RR:src encode_ieee64 R:dst, RR:srcThese instructions convert a floating-point value in the format of section 3 to a single- or double-precision IEEE 754 format. A double-precision IEEE value occupies the entire register dst. A single-precision IEEE value occupies the lower half of register dst, and the upper half is cleared. The internal-format value occupies register pair src.
add_float RR:dst, RR:src1, RR:src2 subtract_float RR:dst, RR:src1, RR:src2 multiply_float RR:dst, RR:src1, RR:src2 divide_float RR:dst, RR:src1, RR:src2These instructions perform the identified floating-point dyadic arithmetic operations on values in internal format in register pairs src1 and src2, producing a result in register pair dst. IEEE 754 semantics govern, except that no rounding is performed.
round_float RR:val, R:info, D12:dispIf the floating-point value in internal format in register pair val has class "normal", it is rounded and placed back in the register pair. Rounding is performed according the specifications in info, whose meaning is as follows. (See also Figure 6.)
bits name meaning
0-7 -- unused (should be zero)
8-9 mode rounding mode: 0 toward nearest
1 toward zero
2 toward -infinity
3 toward +infinity
10-15 length number of bits of coefficient to retain
(0 means to retain all 64 bits)
16-39 expmin minimum exponent for normal class
40-63 expmax maximum exponent for normal class
If the value has class "infinite" or "NaN", or if the value has class
"normal" and the exponent is outside the specified exponent range, a
branch is made according to disp.
load_register R:dst, R:index, I6:dispThe contents of the register whose address is the 6-bit modulus sum of disp and the contents of index is copied to register dst.
store_register R:src, R:index, I6:dispThe contents of register src is copied to the register whose address is the 6-bit modulus sum of disp and the contents of index.
load_direct_32 R:dst, I18:disp load_direct_64 R:dst, I18:disp load_indexed_32 R:dst, R:index, I12:disp load_indexed_64 R:dst, R:index, I12:disp store_direct_32 R:src, I18:disp store_direct_64 R:src, I18:disp store_indexed_32 R:src, R:index, I12:disp store_indexed_64 R:src, R:index, I12:disp
host_load R:contents, R:address, I6:length, I6:disp host_store R:contents, R:address, I6:length, I6:dispThese instructions load or store length bits (where 0 means 64) at the bit address in host memory equal to the (64-bit modulus) sum of address and unsigned disp.
The host memory interface is designed to allow for flexible mapping of emulated programs onto the host's presumed power-of-two-sized memory. Where emulated memory is power-of-two-sized (as for the IBM 360, for example), the mapping is direct and simple. In other cases, such as for a 36-bit machine such as the IBM 704, two distinct approaches can be used.
Note that the general model embodied here in these host access instructions does not place any necessary burden on a host. The emulator hardware will generate a sequence of host memory accesses--appropriate to the host and wrapped with appropriate shifting--to accomplish the semantics described here. In particular, this semantic model can be implemented even for hosts that prohibit unaligned access or provide fewer than 64 bits for each access.
decode_address R:dst, R:src, I6:upw, I6:ulenThe value in register src is divided by unsigned upw (units per 64-bit word). (A zero value for upw means 64.) The 58 low bits of quotient are concatenated to the 6-bit product of the remainder and unsigned ulen (the unit length) and the result is placed in register dst. (A zero value for ulen means 64.) This result is the proper bit address for the designated emulated memory unit in host memory. The number of low-order bits of each 64 bits of host memory that are "wasted" is 64 - upw * ulen. It must be that upw * ulen <= 64.
encode_address R:dst, R:src, I6:upw, I6:ulenThe product of unsigned upw and the high-order 58 bits of register src are added to the quotient of the low-order 6 bits of src and unsigned ulen and placed in register dst.
increment_host_address R:dst, R:src, I6:incr, I6:limitUnsigned incr is added to the low-order 6 bits of the host (bit) address in register src. (A zero value for incr means 64.) If the result equals or exceeds unsigned limit, the low 6 bits are set to zero and the high-order 58 bits are incremented.
decrement_host_address R:dst, R:src, I6:incr, I6:limitUnsigned incr is subtracted from the low-order 6 bits of the host (bit) address in register src. (A zero value for incr means 64.) If the result is negative, the low 6 bits are set to unsigned limit - incr and the high-order 58 bits are decremented.
The emulator operates in Big Endian mode. However, in order to ease emulation of Little Endian architectures (albeit at some performance cost), the following instructions are provided.
swap_bytes_2 R:dst, R:src swap_bytes_4 R:dst, R:src swap_bytes_8 R:dst, R:srcThese instructions swap bytes to account for Little Endian addressing. swap_bytes_2 is for PDP11-style addressing, for data of size 16, 32, and 64. swap_bytes_4 is for 4-byte VAX quantities. swap_bytes_8 is for 8-byte VAX quantities.
branch_cond I6:cond, R:src, D12:disp12Branch if the condition described by cond is met for register src. The condition is met if any of the following is true:
cond bit n is 1 and the value of src is
------------------------------------------------
3 maximum negative
2 less than zero but not maximum negative
1 equal to zero
0 greater than zero
The following mnemonics are assigned to the indicated combinations of
the above conditions.
3 2 1 0 mnemonic
0 0 0 0
0 0 0 1 gt0
0 0 1 0 eq0
0 0 1 1 ge0
0 1 0 0
0 1 0 1
0 1 1 0
0 1 1 1
1 0 0 0
1 0 0 1
1 0 1 0
1 0 1 1
1 1 0 0 lt0
1 1 0 1 ne0
1 1 1 0 le0
1 1 1 1
branch_bit_0 I6:bit, R:src, D12:disp12 branch_bit_1 I6:bit, R:src, D12:disp12Branch if bit bit of register src is zero/one. This instruction is especially useful with the result of compare.
branch_direct R:link, I18:disp branch_indexed R:link, R:index, I12:dispBranch unconditionally and place the address of the following instruction into register link. link usually specifies register 0 when the return address is not needed.
dispatch I6:log, R:src, W:width, O:offsetBranch unconditionally to the address which is the (18-bit modulus) sum of the address of the next instruction and the result of shifting the designated field of register src left unsigned log bits. log must be less than 8.
That is, a dispatch instruction is followed, in line, by a table of instructions. There are 2^width entries to the table, corresponding to the 2^width possible values for the specified field. These entries occur every 2^log instructions, starting with the first instruction following the dispatch.
bump_branch_<rel> R:index, R:addend, R:limit, D6:disp bump_branch_<rel>_imm R:index, I6:addend, I6:limit, D6:dispAugment register index by signed addend. Branch if the value of register index before its augmentation is in the designated relation to signed limit. <rel> can be lt, le, gt, ge, eq, or ne.
branch_on_signed_overflow I6:len, R:src, D12:disp branch_on_unsigned_overflow I6:len, R:src, D12:dispBranch if the value in register src is outside the range of (unsigned) len-bit signed/unsigned values. A zero value for len means 64.
When len is 0 (meaning 64), the result depends on the value of ovflo or carry, for signed and unsigned cases, respectively. Otherwise, the content of src determines the result. For the signed case, a branch results if all bits to the left of the most significant bit don't match the sign bit. For the unsigned case, a branch results if all bits to the left of the most significant bit are not zero.
return_from_exception R:addrBranch unconditionally to the address in addr and turn exception mode off.
signal_host R:requestRequest service request from the host computer. See section 8, "Supervision", for descriptions of the defined requests.
Because the exception mechanism serves most importantly as an interruption mechanism by which the host can communicate with the emulator, full description of the exception mechanism is deferred to the next section.
The following signals are defined.
During emulated program execution, I/O requests recognized during emulated instruction execution (such as an RD instruction on the 650) are satisfied by requests sent to the host with signal_host instructions. Upon completion of such a service request, the host computer raises a service_completion_exception in the emulator. See the treatment of the RD and PCH instructions in Appendix 4 for examples. The specifics of such interactions depend on the computer being emulated and the host computer.
When the emulated program has terminated normally, the emulator signals this to the host computer with a service request that indicates emulated program termination.
The emulator includes two registers not part of the programming model, an instruction counter and an instruction limit. These registers are controlled by the host program to provide a rudimentary debugging facility for emulator programs. The instruction counter simply advances by one for each emulator instruction executed. When the instruction counter matches the instruction limit, an exception is raised, and the emulator's exception handler saves the emulator state (in memory visible to the host), signals the host, and waits for further instruction. The host can examine the emulator's state, modify it (including the value of the instruction limit register), and resume execution.
Typical sizes for working stores were examined during the initial investigation. (See section 1, "Application Characteristics".) It was concluded there that most architectures have fewer than around 200 bytes or so of working storage. Assuming 64-bit registers, these register set sizes yield these amounts of working storage:
number of working storage
registers in bytes
32 256
64 512
128 1024
256 2048
So it is seen that 32 registers are barely enough, while 256 registers
appear to be plenty.
The actual number chosen, 64, resulted from secondary issues that are
addressed next.
Adopting that model has two strong influences on instruction set design. First, in order to maintain the separation of result register from operand registers, the field manipulation instructions must (at least sometimes) specify four operands! Fortunately, two of these operands are short: of width equal to the ceiling of the log of the number of bits in a register. Of course, with a power-of-two number of registers, the ceiling function makes no difference, and instruction bandwidth is not wasted. Just as importantly, it becomes clear that it is quite desirable that the number of registers be equal to the number of bits in a register, so that the same field of an instruction can comfortably specify either a register address or a bit position within a register.
In the MC88000, this common number was 32. For the emulator, we have already seen that 32 registers would barely be enough. Considering greater numbers, we see that with 64 registers, four operands would fit in a 32-bit instruction, leaving 8 bits for opcode. This allocation seems rather serendipitous, but we'd still prefer more registers. Could 128 registers be made to work? Unfortunately not, as four 7-bit operand specifications would leave a paltry 4 bits for opcode information. So, 64 64-bit registers seems quite an acceptable balance point.
The final step in shaping the instruction format is to order the fields simply left to right and realize that not all instructions need to specify four operands. It is quite natural to let the final needed field "soak up" the remainder of the instruction word's 32 bits, especially when that field serves as an immediate value, typically a shorthand for a more general specification using registers.
We choose not to provide conversion instructions among fixed-point formats, because the conversions are simple using already existing instructions (see Appendix 1), and because we don't feel that the frequency of ones-complement and signed-magnitude formats justifies additional instructional support.
Many machines provide multiple-precision fixed-point arithmetic. In addition, there is some chance that we will want to emulate a machine that manipulates fixed-point values larger than 64 bits wide. For both of these reasons, we provide multiple-precision support. (See add_with_carry_out, subtract_with_carry_out, add_with_carry_in_and_out, and subtract_with_carry_in_and_out under "Fixed-Point Arithmetic Operations" in section 6.) Also, multiplications produce a double-length product.
Given this situation, one approach would be simply to adopt the IEEE format wholesale. The convergence to the IEEE standard among current architectures would mean that, over time, the emulator's format would be a perfect fit for more and more possible emulation targets. Moreover, in emulating a machine with a different format, one could still use IEEE-format facilities if one were willing to accept less-than-perfect emulation with respect to floating-point precision.
While this approach is reasonable, we prefer not to penalize so sharply those emulation targets whose floating-point format is not IEEE-compliant. Instead, we define a new (and admittedly peculiar) internal floating-point format with two key properties:
It could be questioned whether trying to handle the complete set of word sizes (from 6 to 95) listed in section 1 is worthwhile. The primary features that facilitate handling various word sizes are the bit and field manipulation instructions. The main reason these instructions are defined so richly is to facilitate decoding of instructions. Once these instructions are present, there is little additional cost (beyond the host memory model mentioned above) to provide graceful manipulation of various word sizes.
The condition specification is essentially "microcoded". Two independent conditions are evaluated: the state of the sign bit of the register and the "zero-ness" of all the other bits in the register. These two conditions give rise to four possibilities, corresponding to the four bits in the specification. I have always imagined that, by encoding the choices this way, the number of logic levels was minimized. This is important in a branch instruction because it is important to decide the direction of the branch as soon as possible. Note that the other conditional branch instructions (branch_bit_0 and branch_bit_1) have even simpler conditions. Put another way, when two comparands are to be compared, the comparison is separated from the branch, ensuring plenty of time to decide the branch direction. Thus, manifesting this low-level specification of the branch condition allows one to keep the cycle time minimal.
With the same goal of reducing cycle time in mind, I chose to specify that the comparison in the bump_branch_... instructions be performed on the value of the index register before its augmentation. This removes the time for an addition from the critical path. Whether this reduction would actually matter is not known. Even after this simplification, the bump_branch_... instructions are significantly more complex than branch_cond, for example.
We show an outline for an emulator for the IBM 360 in Appendix 3. The sketch shows the basic instruction fetch cycle, complete with program error and interrupt checking, including an instruction for each of the RR and RX formats.
In Appendix 4 we give a much more extended sketch of an emulator for the IBM 650. Whereas the IBM 360 matches the emulator in several key respects (including powers-of-two sizes and twos-complement integers), the IBM 650 differs much more markedly, being fully decimal with a word size of 11 decimal digits. The sketch for the 650 also shows much more explicitly how the emulator and host machines interact.