My Rambling Thoughts

Brief notes on 8087

The 8087 was designed by a numerical analysis expert and served as working proof for the IEEE 754 floating-point spec. It was revolutionary. It was released in 1980, the spec was ratified in 1985.

Before the 8087, floating-point math had proprietary formats (limiting inter-operability), lacked accurary and consistency (rounding and precision) and was mostly emulated (so ultra-slow).

The 8087 supports IEEE 754 single precision (32-bit) and double precision (64-bit) formats. Internally, it uses a stack-based 8-deep set of 80-bit FP registers. Each FP register has 1 sign bit, 15-bit exponent and 64-bit mantissa (no implicit bit).

Most 8087 instructions operate on ToS (Top-of-Stack). Programmers were used to operands and were unfamiliar with stack-based operations. It was a struggle to write efficient code.

The biggest issue with 8087 is its buggy stack architecture. Due to misalignment between the design and hardware teams, the hardware does not automatically spill an overflowed stack to memory-based virtual stack. It is handled as an exception which is complex and slow. Software work around it by not overflowing the stack in the first place.

Because of this, it gives unpredictable inconsistent result depending whether the calcuations are done entirely in 80-bit registers or spilled into 64-bit FP in memory (with less precision) midway — depending on compiler and optimization level.

At one point, it was thought the reliance on ST(0) made it impossible to pipeline FP operations — because they all used ST(0). But Pentium proved it was possible to do register renaming with FXCH and achieved pipelined FP operations. It was a breakthrough. From that point, x86 FP became competitive in speed with RISC CPUs.

The second issue is that explicit synchronization is needed using FWAIT. This is needed 99% of the time, so FP assembly instructions insert FWAIT automatically before the actual instruction. This is not needed from 80287 onwards as the CPU waits for the FPU automatically.

8087 runs in parallel to 8086. It is slow compared to integer operations, e.g. FADD takes 70 – 100 cycles, so it is possible to run many x86 instructions before executing the next FP operation. Question is, how many programs made use of this?

The third is emulation. Unlike 80286, 8086 does not raise a Coprocessor Absent exception that would have allowed transparent software emulation. This means the executable does not contain FP instructions directly, but must use emulator-transformed code that call emulated functions if 8087 is absent, or modified to 8087 code if present — this is an excellent use of self-modifying code.

The technique is pretty clever. The compiler emits actual 8087 code and marks them as requiring fixup (relocation), the linker transforms them into emulated calls using fixup (which is an addition) if emulator support is needed.

(This technique does require FWAIT before each FP instruction to patch properly.)

Side note. 8087 also supports 64-bit integer and 18-digit BCD operations. These are obsolete today.

SafariUni 5x25 bino


Safari Uni (观穹) 5x25 bino

Despite a different housing, it is unmistakably the same bino as VisionKing. They even have the same pouch! VK has "VisionKing" custom label on the pouch. This does not have custom label.

(VK has custom labelled shipping box and packing tape. Now, that's professional. I didn't expect this for a low-end bino.)

Brightness is fine. No noticeable dimness in day time. Same brightness as VK.

Zone of focus in the middle only, as expected. It is sharper than VK and has crispy snap-to-focus.

All lens surfaces seem coated, unsure if prism is coated — can see one white and some cream/yellow (single coated) reflection like VK. No phase coating.

Bino focusing knob claims 15° and FMC.

Twist-up eyecups have click-stops.

Measured weight is 577 g w/ front caps and w/o strap. This is heavier than all my other binos!

Claims min focus distance is 1.9 m, actual is ~3 m. This is worse than VK!

I wonder if there are two variants depending on the batch. The min focus distance is either 1.9 m or 3 m. Safari Uni got the 1.9 m one in their first batch, so they advertised as such. VK was advertised with 3 m, but mine was ~2 m.

But it's fine, no one uses this bino for macro viewing. :lol: (Note that this is a valid use-case for binos.)

VisionKing 5x25 bino


VisionKing (视界王) 5x25 bino

Brightness is fine. No noticeable dimness in day time. Looking at the sky through the objective lens at an arm's length, it is as bright as Svbony SV202. At night, seems brighter than SV202! This is possible, 5 mm exit pupil (25 / 5) vs 4 mm (32 / 8).

Zone of focus in the middle only, as expected. It is not super sharp, but is acceptable. A bit hard to achieve optimal focus due to 'mushy' focus — the focus zone is blurry.

All lens surfaces seem coated, even the objective glass. Prism is said to be uncoated. Touchlight test shows white (no coating) and cream/yellow (single coating) reflections among purple and green (multi-coating). No phase coating.

Bino focusing knob claims 15.8° and does not claim MC now. Box says FC.

Twist-up eyecups are friction-based, no clicks and a bit loose.

Measured weight is 540 g w/ front caps and w/o strap. This is heavier than most of my binos!

Claims min focus distance is 3 m, actual is ~2.1 m!

Greenfield 8086: accessing flat memory

Short of making 8086 into a 24-bit CPU, we need to add instructions to manipulate 24-bit addresses. Taking cue from pointer arithmetic, not many are needed:

  • MOV.A, ADD.A, SUB.A, CMP.A
  • ADC.A, SBB.A
  • AND.A, OR.A
  • LEA
  • PUSH.A, POP.A
  • MOV.H

MOV, ADD, SUB and CMP are basic operations.

ADC and SBB can be used to construct >24-bit pointers, but I don't know if they will ever be used.

Bitwise AND and OR are useful.

LEA is of course essential. There is no 16-bit variant, only 24-bit.

PUSH.A pushes 24-bit value onto the stack (as two 16-bit values) and POP.A pops two 16-bit values.

MOV.H moves a value into the upper bits of another register and vice-versa. This allows more extensive, though lengthier, operations using standard 16-bit instructions. Technically, with MOV.H, we do not need any 24-bit specific instructions, but the operations will be lengthier.

Adding to a pointer:

; w/ 24-bit instructions
add.a si, 4

; w/ 16-bit instructions
mov.h ax, si
add si, 4
adc ax, 0
mov.h si, ax

Calculating diff:

; Calculate si - di

; w/ 24-bit instructions
mov.a ax, si
sub.a ax, di   ; ax (24-bit) contains the diff

; w/ 16-bit instructions
mov ax, si
mov.h dx, si
mov.h cx, di
sub ax, di
sbb dx, cx     ; dx:ax contains the diff

Stack operations

Remember I said SP should not be a GPR? It means we need SP specific instructions as well. Only a few are needed:

  • MOV.A SP, r/m | immed
  • MOV.A r/m, SP
  • ADD.A SP, r/m | +/-immed
  • SUB.A SP, r/m

SP can be used directly to reference stack variables:

push arg
call proc
...

proc:
sub.a sp, N    ; reserve space for local vars

mov ax, [sp + N + 4]   ; arg
mov bx, [sp + N - N]   ; first local var
mov cx, [sp + N - 2]   ; last local var
...

add.a sp, N    ; epilogue
ret 2

Or via a frame pointer:

push arg
call proc
...

proc:
push.a fx
mov.a fx, sp
sub.a sp, N    ; reserve space for local vars

mov ax, [fx + 8]       ; arg
mov bx, [fx - N]       ; first local var
mov cx, [fx - 2]       ; last local var
...

mov.a sp, fx   ; epilogue
pop.a fx
ret 2

Any register that allows register indirect addressing can be used — memory is flat!

If we use a dedicated frame pointer, we end up with 7 GPRs. If we don't, then it is just as difficult to get stack trace and unwind the stack as no frame pointer — cos we don't know which is the frame pointer!

Unwinding the stack without frame pointer is a big issue on modern CPUs. Metadata is needed.

Using Return Stack

It is not common to have two explicit stacks, one solely for return address, the other for data, though modern CPUs use shadow stack or Return-Address Stack (RAS) to prevent Return Oriented Programming (ROP).

I don't see why not — it seems trivial to support it. Stack buffer overflow — whether unintentional or malicious — is a never-ending source of bug. We need a separate SP — let's call it RSP — a couple of instructions and change CALL/RET to use it.

New instructions needed:

  • MOV.A RSP, r/m
  • MOV.A r/m, RSP

These are privileged instructions. The Return Stack should be on special protected pages if MMU is present. There are no instructions to PUSH/POP nor manipulate the Return Stack. It is purely for CALL and RET.

To help to unwind the stack, we push SP onto the Return Stack too, so each CALL uses 8 bytes (4 for return address and 4 for SP).

CALL and RET must be paired. There is no need to restore SP because it is done automatically.

push arg
call proc
...

proc:
sub.a sp, N    ; reserve space for local vars

mov ax, [sp + N]       ; arg
nov bx, [sp + N - N]   ; first local var
mov cx, [sp + N - 2]   ; last local var
...

ret 2          ; no epilogue needed, will restore SP

One downside is that we can no longer RET to an arbitrary address — but this is the whole point!

push.a ax      ; no longer allowed
ret

Efficiency

Pushing four 16-bit words on each CALL on a 16-bit CPU is inefficient. Even functions that do not use the stack pay this penalty.

Since the address space is 24-bit, we use the upper 8 bits to store the stack size in words. This allows up to 510 bytes of local variables. If more is needed, we can use 255 to mean the lower 24-bit is SP — we push two additional words in this case.

Thus, we push only two 16-bit words on the Return Stack for each CALL in most cases. If the function uses ENTER, it modifies the top 8 bits of the return address or pushes two additional 16-bit words.

Revamped code:

push arg
call proc

proc:
enter N        ; reserve space for local vars, updates Return Stack

mov ax, [sp + N]       ; arg
nov bx, [sp + N - N]   ; first local var
mov cx, [sp + N - 2]   ; last local var
...

ret 2          ; no epilogue needed, will restore SP

Unfortunately, this scheme does not work if we PUSH onto the stack. It is possible to make it work. This is left as an exercise for the reader. :-P

In the future, for 32-bit CPU, we will always push two 32-bit words (Return Address and SP) on the Return Stack.

Greenfield 8086: flat memory model

The first thing I'm going to get rid of is the 8086's segmented memory model — its defining characteristic!

Segmented memory model works well in the 60s and 70s, simplifying code/data relocation and is a cheap way to provide protection in multi-process environment.

It works well as long as your data fits within a segment. Once exceed, it is painful.

With hindsight, we can see segmentation falling out of favour with paging being the choice of memory management.

8086 also has a bigger address space (20-bit) than its register size (16-bit), so it is difficult to address the entire space.

80286 — nearly flat memory

80286 uses segment selectors instead of physical segments in Protected Mode. The segment registers index into a Descriptor Table that contains the base of the segment, among others.

If the upper bits of a logicial address go directly into the lower bits of the selector, we can access >64 kB almost seamlessly.

But Intel put 3 control bits at the bottom, so complicated pointer arithmetic was needed again (need to +8 to increment to next selector).

; ideal sel:ofs
0000:ffff + 1 -> 0001:0000

; 286
0000:ffff + 1 -> 0008:0000

It is not 100% seamless — a single element cannot span segments, so >64 kB data has to be accessed carefully. Maybe this was why Intel purposely made selectors non-contiguous — you needed special handling code anyway.

Anyway, this is water under the bridge.

Other ways

The alternative to segmentation is flat memory. We need to either widen register size or use paired-registers.

Instead of 16-bit registers, we will have 20-bit registers. This allows us to put a full address in a register and dereference it directly. But this raises a question. How do we manipulate these 20-bit registers? Does it mean it is a 20-bit CPU now?

The other approach is to use paired registers. This is a common approach. 8-bit CPUs pair two 8-bit registers to access 16-bit memory space. But the 8086 has only 8 registers. Pairing them means we only have 4 — typically we need 2 – 3 pointers at the same time, so we only 4 – 2 registers left.

I'll go for the widened register approach. Instead of 20-bit, let's go for 24-bit — giving 16 MB address space. All registers are 24-bit. The CPU remains 16-bit. Most instructions manipulate 16-bit data, but some manipulate 24-bit — for pointer arithmetic.

With linear addressing, there is no more memory models. All pointers are FAR, indirect jumps/calls are FAR and all function returns are FAR. We are free from the 64 kB barrier.

The stack is also free of its 64 kB limit (though stacks seldom grow this big), but more importantly, any register can now reference the stack directly.

This does increase pointer size from 2 to 4 bytes. This makes the ISA unsuitable for systems with 64 kB or less since they only need 2-byte offsets. Once we get above a threshold, say 128 kB, the overhead no longer matters.

Another con is that we need a big relocation table. In the absence of paging and running each process in its own isolated memory space, all global code/data references need to be relocated.

A possibility is to use Global Offset Table (GOT). This makes the code PIC (Position-Independent Code) and make it reusable in multi-process environments.