The 8087 was designed by a numerical analysis expert and served as working proof for the IEEE 754 floating-point spec. It was revolutionary. It was released in 1980, the spec was ratified in 1985.
Before the 8087, floating-point math had proprietary formats (limiting inter-operability), lacked accurary and consistency (rounding and precision) and was mostly emulated (so ultra-slow).
The 8087 supports IEEE 754 single precision (32-bit) and double precision (64-bit) formats. Internally, it uses a stack-based 8-deep set of 80-bit FP registers. Each FP register has 1 sign bit, 15-bit exponent and 64-bit mantissa (no implicit bit).
Most 8087 instructions operate on ToS (Top-of-Stack). Programmers were used to operands and were unfamiliar with stack-based operations. It was a struggle to write efficient code.
The biggest issue with 8087 is its buggy stack architecture. Due to misalignment between the design and hardware teams, the hardware does not automatically spill an overflowed stack to memory-based virtual stack. It is handled as an exception which is complex and slow. Software work around it by not overflowing the stack in the first place.
Because of this, it gives unpredictable inconsistent result depending whether the calcuations are done entirely in 80-bit registers or spilled into 64-bit FP in memory (with less precision) midway — depending on compiler and optimization level.
At one point, it was thought the reliance on ST(0) made it impossible to pipeline FP operations — because they all used ST(0). But Pentium proved it was possible to do register renaming with FXCH and achieved pipelined FP operations. It was a breakthrough. From that point, x86 FP became competitive in speed with RISC CPUs.
The second issue is that explicit synchronization is needed using FWAIT. This is needed 99% of the time, so FP assembly instructions insert FWAIT automatically before the actual instruction. This is not needed from 80287 onwards as the CPU waits for the FPU automatically.
8087 runs in parallel to 8086. It is slow compared to integer operations, e.g. FADD takes 70 – 100 cycles, so it is possible to run many x86 instructions before executing the next FP operation. Question is, how many programs made use of this?
The third is emulation. Unlike 80286, 8086 does not raise a Coprocessor Absent exception that would have allowed transparent software emulation. This means the executable does not contain FP instructions directly, but must use emulator-transformed code that call emulated functions if 8087 is absent, or modified to 8087 code if present — this is an excellent use of self-modifying code.
The technique is pretty clever. The compiler emits actual 8087 code and marks them as requiring fixup (relocation), the linker transforms them into emulated calls using fixup (which is an addition) if emulator support is needed.
(This technique does require FWAIT before each FP instruction to patch properly.)
Side note. 8087 also supports 64-bit integer and 18-digit BCD operations. These are obsolete today.
Despite a different housing, it is unmistakably the same bino as VisionKing. They even have the same pouch! VK has "VisionKing" custom label on the pouch. This does not have custom label.
(VK has custom labelled shipping box and packing tape. Now, that's professional. I didn't expect this for a low-end bino.)
Brightness is fine. No noticeable dimness in day time. Same brightness as VK.
Zone of focus in the middle only, as expected. It is sharper than VK and has crispy snap-to-focus.
All lens surfaces seem coated, unsure if prism is coated — can see one white and some cream/yellow (single coated) reflection like VK. No phase coating.
Bino focusing knob claims 15° and FMC.
Twist-up eyecups have click-stops.
Measured weight is 577 g w/ front caps and w/o strap. This is heavier than all my other binos!
Claims min focus distance is 1.9 m, actual is ~3 m. This is worse than VK!
I wonder if there are two variants depending on the batch. The min focus distance is either 1.9 m or 3 m. Safari Uni got the 1.9 m one in their first batch, so they advertised as such. VK was advertised with 3 m, but mine was ~2 m.
But it's fine, no one uses this bino for macro viewing. :lol: (Note that this is a valid use-case for binos.)
Brightness is fine. No noticeable dimness in day time. Looking at the sky through the objective lens at an arm's length, it is as bright as Svbony SV202. At night, seems brighter than SV202! This is possible, 5 mm exit pupil (25 / 5) vs 4 mm (32 / 8).
Zone of focus in the middle only, as expected. It is not super sharp, but is acceptable. A bit hard to achieve optimal focus due to 'mushy' focus — the focus zone is blurry.
All lens surfaces seem coated, even the objective glass. Prism is said to be uncoated. Touchlight test shows white (no coating) and cream/yellow (single coating) reflections among purple and green (multi-coating). No phase coating.
Bino focusing knob claims 15.8° and does not claim MC now. Box says FC.
Twist-up eyecups are friction-based, no clicks and a bit loose.
Measured weight is 540 g w/ front caps and w/o strap. This is heavier than most of my binos!
Claims min focus distance is 3 m, actual is ~2.1 m!
Short of making 8086 into a 24-bit CPU, we need to add instructions to manipulate 24-bit addresses. Taking cue from pointer arithmetic, not many are needed:
MOV, ADD, SUB and CMP are basic operations.
ADC and SBB can be used to construct >24-bit pointers, but I don't know if they will ever be used.
Bitwise AND and OR are useful.
LEA is of course essential. There is no 16-bit variant, only 24-bit.
PUSH.A pushes 24-bit value onto the stack (as two 16-bit values) and POP.A pops two 16-bit values.
MOV.H moves a value into the upper bits of another register and vice-versa. This allows more extensive, though lengthier, operations using standard 16-bit instructions. Technically, with MOV.H, we do not need any 24-bit specific instructions, but the operations will be lengthier.
Adding to a pointer:
; w/ 24-bit instructions add.a si, 4 ; w/ 16-bit instructions mov.h ax, si add si, 4 adc ax, 0 mov.h si, ax
Calculating diff:
; Calculate si - di ; w/ 24-bit instructions mov.a ax, si sub.a ax, di ; ax (24-bit) contains the diff ; w/ 16-bit instructions mov ax, si mov.h dx, si mov.h cx, di sub ax, di sbb dx, cx ; dx:ax contains the diff
Remember I said SP should not be a GPR? It means we need SP specific instructions as well. Only a few are needed:
SP can be used directly to reference stack variables:
push arg call proc ... proc: sub.a sp, N ; reserve space for local vars mov ax, [sp + N + 4] ; arg mov bx, [sp + N - N] ; first local var mov cx, [sp + N - 2] ; last local var ... add.a sp, N ; epilogue ret 2
Or via a frame pointer:
push arg call proc ... proc: push.a fx mov.a fx, sp sub.a sp, N ; reserve space for local vars mov ax, [fx + 8] ; arg mov bx, [fx - N] ; first local var mov cx, [fx - 2] ; last local var ... mov.a sp, fx ; epilogue pop.a fx ret 2
Any register that allows register indirect addressing can be used — memory is flat!
If we use a dedicated frame pointer, we end up with 7 GPRs. If we don't, then it is just as difficult to get stack trace and unwind the stack as no frame pointer — cos we don't know which is the frame pointer!
Unwinding the stack without frame pointer is a big issue on modern CPUs. Metadata is needed.
It is not common to have two explicit stacks, one solely for return address, the other for data, though modern CPUs use shadow stack or Return-Address Stack (RAS) to prevent Return Oriented Programming (ROP).
I don't see why not — it seems trivial to support it. Stack buffer overflow — whether unintentional or malicious — is a never-ending source of bug. We need a separate SP — let's call it RSP — a couple of instructions and change CALL/RET to use it.
New instructions needed:
These are privileged instructions. The Return Stack should be on special protected pages if MMU is present. There are no instructions to PUSH/POP nor manipulate the Return Stack. It is purely for CALL and RET.
To help to unwind the stack, we push SP onto the Return Stack too, so each CALL uses 8 bytes (4 for return address and 4 for SP).
CALL and RET must be paired. There is no need to restore SP because it is done automatically.
push arg call proc ... proc: sub.a sp, N ; reserve space for local vars mov ax, [sp + N] ; arg nov bx, [sp + N - N] ; first local var mov cx, [sp + N - 2] ; last local var ... ret 2 ; no epilogue needed, will restore SP
One downside is that we can no longer RET to an arbitrary address — but this is the whole point!
push.a ax ; no longer allowed ret
Pushing four 16-bit words on each CALL on a 16-bit CPU is inefficient. Even functions that do not use the stack pay this penalty.
Since the address space is 24-bit, we use the upper 8 bits to store the stack size in words. This allows up to 510 bytes of local variables. If more is needed, we can use 255 to mean the lower 24-bit is SP — we push two additional words in this case.
Thus, we push only two 16-bit words on the Return Stack for each CALL in most cases. If the function uses ENTER, it modifies the top 8 bits of the return address or pushes two additional 16-bit words.
Revamped code:
push arg call proc proc: enter N ; reserve space for local vars, updates Return Stack mov ax, [sp + N] ; arg nov bx, [sp + N - N] ; first local var mov cx, [sp + N - 2] ; last local var ... ret 2 ; no epilogue needed, will restore SP
Unfortunately, this scheme does not work if we PUSH onto the stack. It is possible to make it work. This is left as an exercise for the reader. :-P
In the future, for 32-bit CPU, we will always push two 32-bit words (Return Address and SP) on the Return Stack.
The first thing I'm going to get rid of is the 8086's segmented memory model — its defining characteristic!
Segmented memory model works well in the 60s and 70s, simplifying code/data relocation and is a cheap way to provide protection in multi-process environment.
It works well as long as your data fits within a segment. Once exceed, it is painful.
With hindsight, we can see segmentation falling out of favour with paging being the choice of memory management.
8086 also has a bigger address space (20-bit) than its register size (16-bit), so it is difficult to address the entire space.
80286 uses segment selectors instead of physical segments in Protected Mode. The segment registers index into a Descriptor Table that contains the base of the segment, among others.
If the upper bits of a logicial address go directly into the lower bits of the selector, we can access >64 kB almost seamlessly.
But Intel put 3 control bits at the bottom, so complicated pointer arithmetic was needed again (need to +8 to increment to next selector).
; ideal sel:ofs 0000:ffff + 1 -> 0001:0000 ; 286 0000:ffff + 1 -> 0008:0000
It is not 100% seamless — a single element cannot span segments, so >64 kB data has to be accessed carefully. Maybe this was why Intel purposely made selectors non-contiguous — you needed special handling code anyway.
Anyway, this is water under the bridge.
The alternative to segmentation is flat memory. We need to either widen register size or use paired-registers.
Instead of 16-bit registers, we will have 20-bit registers. This allows us to put a full address in a register and dereference it directly. But this raises a question. How do we manipulate these 20-bit registers? Does it mean it is a 20-bit CPU now?
The other approach is to use paired registers. This is a common approach. 8-bit CPUs pair two 8-bit registers to access 16-bit memory space. But the 8086 has only 8 registers. Pairing them means we only have 4 — typically we need 2 – 3 pointers at the same time, so we only 4 – 2 registers left.
I'll go for the widened register approach. Instead of 20-bit, let's go for 24-bit — giving 16 MB address space. All registers are 24-bit. The CPU remains 16-bit. Most instructions manipulate 16-bit data, but some manipulate 24-bit — for pointer arithmetic.
With linear addressing, there is no more memory models. All pointers are FAR, indirect jumps/calls are FAR and all function returns are FAR. We are free from the 64 kB barrier.
The stack is also free of its 64 kB limit (though stacks seldom grow this big), but more importantly, any register can now reference the stack directly.
This does increase pointer size from 2 to 4 bytes. This makes the ISA unsuitable for systems with 64 kB or less since they only need 2-byte offsets. Once we get above a threshold, say 128 kB, the overhead no longer matters.
Another con is that we need a big relocation table. In the absence of paging and running each process in its own isolated memory space, all global code/data references need to be relocated.
A possibility is to use Global Offset Table (GOT). This makes the code PIC (Position-Independent Code) and make it reusable in multi-process environments.
The 8086 instruction set has very nice orthogonal instructions, but it also has a bunch of short instructions that use up a lot of valuable one-byte opcode space.
When Intel created the 32-bit 80386, there was really no reason to keep the same encodings — the object code needed to be regenerated anyway.
And when AMD defined the 64-bit instruction set (Intel was not interested at this point because they were creating 64-bit Itanium), they also missed the chance to redefine the encodings.
What if we could go back to the very beginning and define the instruction set properly without considering backwards compatibility with 8080/8085?
Key questions:
Note that this is a 16-bit processor. We need to keep in mind future extensions like floating-point math, 32-bit and 64-bit modes, and vector instructions.
x86 is CISC, so it will have variable-length instructions. The question is, do we want 1-byte instructions?
8086 was designed in the mid-70s and released in 1978. At that time, microcomputers had 4 kB of memory. The IBM PC shipped in 1981 with 16 kB, expandable to 64 kB on the motherboard. By 1985, it was common to fill up the entire 640 kB RAM space.
1-byte instructions are essential with 4 kB memory, not so much with 640 kB. Most x86 1-byte instructions are not high-occurrence instructions either.
Removing 1-byte instructions free up valuable opcode space that can be used for defining shorter multi-byte instructions.
Load-store architecture is a RISC characteristic. CISC generally allows ALU operations directly on memory.
There are 3 kinds of mem ALU operations: reg-to-mem, mem-to-reg and mem-to-mem.
x86 supports the first two. It turns out that reg-to-mem (i.e. mem = mem op reg) is not good for future superscalar execution.
; load-store mov ax, [mem] add ax, bx mov [mem], ax ; mem-to-reg add bx, [mem] mov [mem], bx ; reg-to-mem add [mem], bx
RISC likes to have 32 GPRs, but they are an overkill — kills interrupt and context switch performance. 16 GPRs is generally sufficient, especially with register renaming. I think 8 is enough, but they must really be general-purpose!
The 8086 has 8 GPRs, but it really only has 6 — SP is not a GPR and BP is needed to access stack variables.
x86-64 uses REX prefix to expand the number of registers, among others. This is a very useful technique that can be added later.
8086 has very limited memory addressing modes. Only 4 registers can be used for register-indirect, and only in limited ways.
80386 revamps this with SIB (scale-index-base) which is super flexible — but it is not needed most of the time.
; x86 [bx] [bx + disp] [bx + si] [bx + si + disp] ; SIB [bx * 2] [bx * 2 + disp] [si] [si + disp] [bx * 2 + si] [bx * 2 + si + disp]
Two are sufficient: register-indirect and register-indirect with offset.
With hindsight, it is very useful to have PC-relative addressing. This is used by x86-64, ARM and MIPS. It enables loading big literals (e.g. 64-bit) without making the instructions super long. It also enables PIC (Position-Independent Code).
Immediate operand increases instruction size. Do we support 1-byte and 2-byte immediates? Do we support 4-byte and 8-byte immediates in the future?
The 8086 has 1-byte and 2-bytes immediates. 80386 supports 1-byte and 4-bytes immediates. (2-bytes is supported via a size prefix, making it 3-bytes.)
Displacement addressing has one optional offset (1 – 2 bytes). Immediate has 1 – 2 bytes. It adds 4 additional bytes to the instruction in 16-bit mode, but in 32-bit mode, it is 8 additional bytes. It makes for very long instructions.
Example:
mov [mem], imm
Generally, 64-bit CPUs do not have 64-bit displacement nor immediate — they make the instruction too long and they are not often used.
As CISC, misaligned mem access is a given. However, there are times we want to enforce word-aligned access, for example, the stack, jump and call targets.
By strings, I mean the famous LODS, STOS, SCAS, MOVS and CMPS. STOS and MOVS are especially useful when paired with REP.
They are great for manipulating strings in memory constrained systems, but we are way past that.
First, REP can be replaced by a tight loop. Next, with hindsight, MOVS is the only remaining useful instruction. The others can be written using simple instructions. For example, LODS is:
mov ax, [si] add si, 2
The 8086 supports 20-bit address space — 22-bit with segment registers. In the mid-70s, 1 MB address space was unimaginable. But by mid-80s, the IBM PC had already reached its limit (640 kB conventional memory).
We will widen the address space to 24-bit. 16 MB was pretty big even in the early 90s. Windows 95 required only 4 MB of memory, though it ran better with 8 MB.
Does this mean we shift the segment register by 8 bits? We want to get rid of segmentation...
It is difficult to do 20-bit seg:ofs pointer arithmetic and comparison.
Pointer addition (up to 65536 - 16):
; convert es:di to normalized pointer mov dx, es mov ax, di shr ax, 4 add dx, ax mov es, dx and di, 0fh ; es:di is now normalized ; inc pointer by 4 (any value up to 65536 - 16) add di, 4 ; es:di is incremented, but it is not normalized
Pointer addition (any value):
; es:di -> linear pointer in dx:ax mov ax, es mov dx, ax shl ax, 4 shr dx, 12 ; dx:ax now contains 20-bit linear seg addr add ax, di adc dx, 0 ; add 32-bit ofs in cx:bx add ax, bx adc dx, cx ; dx:ax -> normalized pointer in es:di mov di, ax shr ax, 4 ; btm 12-bits of seg shl dx, 12 ; top 4-bits of seg or ax, dx ; combine them mov es, ax and di, 0fh ; es:di is now normalized
Pointer subtraction:
; ds:si -> linear pointer in dx:ax mov ax, ds mov dx, ax shl ax, 4 shr dx, 12 ; dx:ax now contains 20-bit linear seg addr add ax, si adc dx, 0 ; es:di -> linear pointer in cx:bx mov bx, es mov cx, bx shl bx, 4 shr cx, 12 ; cx:bx now contains 20-bit linear seg addr add bx, di adc cx, 0 ; find the diff in dx:ax (ds:si - es:di) sub ax, bx sbb dx, cx
It will help if the CPU can manipulate 20-bit pointers (24-bit with our design) directly. It will also help to have an extra segment register so that DS can point to the global data segment all the time.
Addition with CPU assistance:
mov dx:ax, [p] ptr.p2l dx:ax ; phy->linear addr add ax, bx adc dx, cx ptr.l2n dx:ax ; linear->normalized les di, dx:ax ; es:di -> dx:ax
Subtraction:
mov dx:ax, [p1] ptr.p2l dx:ax mov cx:bx, [p2] ptr.p2l cx:bx sub ax, bx sbb dx, cx
Segmentation allows relocatable code and data without load-time fixup. It does not affect code much, but it is difficult to work with data bigger than 64 kB.
If we want linear addressing, we either need to widen the register size or use register-pairs for memory addressing.
Example of linear addressing:
; p1 and p2 are linear pointers ; *p2++ = *p1++ mov ex:si, [p1] mov ax, [ex:si] ; paired-reg is automatically linear add si, 2 adc ex, 0 mov [p1], ex:si mov ex:si, [p2] mov [ex:si], ax add si, 2 adc ex, 0 mov [p2], ex:si
Pointer manipulation is a pain when address space (20 – 24 bits) > register size (16-bit). The problem goes away with 32-bit — both address space and register size match, and address space is big enough for most use-cases.
8086 uses a separate 64 kB port-mapped I/O address space. If we expand the memory size to 16 MB, we can just use memory-mapped I/O. With hindsight, everyone uses MMIO nowadays.
A popular pattern with port-mapped I/O is to select the index, then read/write the value. This is to reduce I/O ports used — the IBM PC has only 1,024 I/O addresses as it uses 10-bit I/O address on the 8-bit ISA bus (*). With memory-mapped I/O, we just access the I/O registers directly.
; Port I/O mov dx, 0x3d4 ; CGA CRTC index reg mov al, 0 ; 0 = Horizontal Total Register out dx, al inc dx ; CGA CRTC data reg mov al, 0x38 ; value out dx, al ; Mem I/O mov ax, CGA_CRTC_REG_BASE mov es, ax mov es:[CGA_CRTC_HORZ_TOTAL_REG], 0x38
(*) IBM expanded the ISA bus to use 16-bit I/O address with IBM AT, but there were many I/O cards doing 10-bit decoding, so it was not safe to use higher address space — unless you reserved the range in the lower address space first. IBM should have put a compatibility jumper beside each slot that disables access if the higher address bits are non-zero.
Also, memory-mapped I/O gives the expectation that I/O can be read back. Nightmare of CGA where many registers are write-only.