
                       GEKKO PAIRED-SINGLES - WHAT A THESE ?

    ---------------------------------------------------------------------------

    1. Overview

    Paired-Singles are analog of Intel (and other x86 series) processor's 
    "streamed instructions", known as SSE. This extension is specific for
    Gekko processor and using to calculate two single-precision numbers
    ("floats" in C) using only one operation.

    Floating-Point Registers of Gekko (FPRs) are modified in some way :
    one half is used for first single number, and other for second.
    This picture showing FPR format in paired-single mode :

           ---------------------------
    bits: |0..........31|32.........63|
          |-------------+-------------|
          |    PS0      |     PS1     |
           ---------------------------

    This parts are named as "PS0" and "PS1". Since Gekko is working in big-
    endian mode, bits are numbered from-left-to-right order. There total 32
    PS0 and 32 PS1 registers.

    PS instructions set is divided on two parts : Load and Store Quantization
    and Paired-Single Arithmetic instructions. Load and Store Quantization
    instructions are used for fast integer-float type casting and some
    specific memory operations, using PS0 and PS1 parts of FPR. Details are
    given later in this document.

    Paired-single mode is useful for fast vector and matrix calculations.

    2. How to enable Paired-Single Mode

    To enable PS-mode, you should set some bits of Gekko's HID2 custom
    System-Purpose Register (HID2 assigned as SPR 920).

           --------------
    bits: | 0  |1| 2 |
          |----+-+---+--- ... (dont care)
          |LSQE| |PSE|
           --------------

    LSQE    - Paired-Single load and store instructions enabled
    PSE     - Paired-Single mode enabled

    Next low-level assembly code demonstrate, how to enable PS :

        mfspr   r0, 920             # read HID2
        oris    r0, 0xA000          # set LSQE and PSE bits
        mtspr   r0, 920             # write back

    If you try to execute any PS instruction without LSQE and PSE bit set,
    illegal instruction exception will be generated.

    3. Paired-Single Load and Store

    4. Paired-Single Arithmetic

    Sorted opcode list :

      ------------------------------------
     |00100|  D  |00000|  B  |   264    |R|     ps_abs
     |00100|  D  |  A  |  B  |    21    |R|     ps_add
     |00100| D 00|  A  |  B  |    32    |0|     ps_cmpo0
     |00100| D 00|  A  |  B  |    96    |0|     ps_cmpo1
     |00100| D 00|  A  |  B  |     0    |0|     ps_cmpu0
     |00100| D 00|  A  |  B  |    64    |0|     ps_cmpu0
     |00100|  D  |  A  |  B  |    18    |R|     ps_div
     |00100|  D  |  A  |  B  |   528    |R|     ps_merge00
     |00100|  D  |  A  |  B  |   560    |R|     ps_merge01
     |00100|  D  |  A  |  B  |   592    |R|     ps_merge10
     |00100|  D  |  A  |  B  |   624    |R|     ps_merge11
     |00100|  D  |00000|  B  |    72    |R|     ps_mr
     |00100|  D  |00000|  B  |   136    |R|     ps_nabs
     |00100|  D  |00000|  B  |    40    |R|     ps_neg
     |00100|  D  |00000|  B  |    24    |R|     ps_res
     |00100|  D  |00000|  B  |    26    |R|     ps_rsqrte
     |00100|  D  |  A  |  B  |    20    |R|     ps_sub
     |-----+-----+-----+-----+----------+-|
     |00100|  D  |  A  |  B  |  C  | 29 |R|     ps_madd
     |00100|  D  |  A  |  B  |  C  | 14 |R|     ps_madds0
     |00100|  D  |  A  |  B  |  C  | 15 |R|     ps_madds1
     |00100|  D  |  A  |  B  |  C  | 28 |R|     ps_msub
     |00100|  D  |  A  |00000|  C  | 25 |R|     ps_mul
     |00100|  D  |  A  |00000|  C  | 12 |R|     ps_muls0
     |00100|  D  |  A  |00000|  C  | 13 |R|     ps_muls1
     |00100|  D  |  A  |  B  |  C  | 31 |R|     ps_nmadd
     |00100|  D  |  A  |  B  |  C  | 30 |R|     ps_nmsub
     |00100|  D  |  A  |  B  |  C  | 23 |R|     ps_sel
     |00100|  D  |  A  |  B  |  C  | 10 |R|     ps_sum0
     |00100|  D  |  A  |  B  |  C  | 11 |R|     ps_sum1
      ------------------------------------

    Note : R opcode field (comparsion of result with zero) is implemented, 
    but unused by regular GC programs.

    Descriptions :

    PS_ABS      - absolute value

        Clear bit 0 of PS0[B] and copy result to PS0[D]
        Clear bit 0 of PS1[B] and copy result to PS1[D]

    PS_ADD      - add

        PS0[D] = PS0[A] + PS0[B]
        PS1[D] = PS1[A] + PS1[B]

    PS_CMPO0    - compare ordered high

        "c" holds result of comparsion
        If (PS0[A] is NaN or PS0[B] is NaN) then c = 0001b
        Else if (PS0[A] < PS0[B]) then c = 1000b
        Else if (PS0[A] > PS0[B]) then c = 0100b
        Else c = 0010b
        Save result in D field of condition register (CR[D] = c).

    PS_CMPO1    - compare ordered low

        "c" holds result of comparsion
        If (PS1[A] is NaN or PS1[B] is NaN) then c = 0001b
        Else if (PS1[A] < PS1[B]) then c = 1000b
        Else if (PS1[A] > PS1[B]) then c = 0100b
        Else c = 0010b
        Save result in D field of condition register (CR[D] = c).

    PS_CMPU0    - compare unordered high

        "c" holds result of comparsion
        If (PS0[A] is NaN or PS0[B] is NaN) then c = 0001b
        Else if (PS0[A] < PS0[B]) then c = 1000b
        Else if (PS0[A] > PS0[B]) then c = 0100b
        Else c = 0010b
        Save result in D field of condition register (CR[D] = c).

    PS_CMPU1    - compare unordered low

        "c" holds result of comparsion
        If (PS1[A] is NaN or PS1[B] is NaN) then c = 0001b
        Else if (PS1[A] < PS1[B]) then c = 1000b
        Else if (PS1[A] > PS1[B]) then c = 0100b
        Else c = 0010b
        Save result in D field of condition register (CR[D] = c).

    These four compare instructions looks same, because I omitted some 
    unecessary FPSCR stuff.

    PS_DIV      - divide

        PS0[D] = PS0[A] / PS0[B]
        PS1[D] = PS1[A] / PS1[B]

    PS_MERGE00  - merge high

        PS0[D] = PS0[A]
        PS1[D] = PS0[B]

    PS_MERGE01  - merge direct

        PS0[D] = PS0[A]
        PS1[D] = PS1[B]

    PS_MERGE10  - merge swapped

        PS0[D] = PS1[A]
        PS1[D] = PS0[B]

    PS_MERGE11  - merge low

        PS0[D] = PS1[A]
        PS1[D] = PS1[B]

    PS_MR       - move register

        PS0[D] = PS0[B]
        PS1[D] = PS1[B]

    PS_NABS     - negate absolute value

        Set bit 0 of PS0[B] and copy result to PS0[D]
        Set bit 0 of PS1[B] and copy result to PS1[D]

    PS_NEG      - negate

        Invert bit 0 of PS0[B] and copy result to PS0[D]
        Invert bit 0 of PS1[B] and copy result to PS1[D]

    PS_RES      - reciprocal estimate

        PS0[D] = 1 / PS0[B]
        PS1[D] = 1 / PS1[B]

    PS_RSQRTE   - reciprocal square root estimate

        PS0[D] = 1 / SQRT(PS0[B])
        PS1[D] = 1 / SQRT(PS1[B])

    PS_SUB      - subtract

        PS0[D] = PS0[A] - PS0[B]
        PS1[D] = PS1[A] - PS1[B]

    PS_MADD     - multiply-add
        
        PS0[D] = PS0[A] * PS0[C] + PS0[B]
        PS1[D] = PS1[A] * PS1[C] + PS1[B]

    PS_MADDS0   - multiply-add scalar high

        PS0[D] = PS0[A] * PS0[C] + PS0[B]
        PS1[D] = PS1[A] * PS0[C] + PS1[B]

    PS_MADDS1   - multiply-add scalar low

        PS0[D] = PS0[A] * PS1[C] + PS0[B]
        PS1[D] = PS1[A] * PS1[C] + PS1[B]

    PS_MSUB     - multiply-subtract
        
        PS0[D] = PS0[A] * PS0[C] - PS0[B]
        PS1[D] = PS1[A] * PS1[C] - PS1[B]

    PS_MUL      - multiply

        PS0[D] = PS0[A] * PS0[C]
        PS1[D] = PS1[A] * PS1[C]

    PS_MULS0    - multiply scalar high

        PS0[D] = PS0[A] * PS0[C]
        PS1[D] = PS1[A] * PS0[C]

    PS_MULS1    - multiply scalar low

        PS0[D] = PS0[A] * PS1[C]
        PS1[D] = PS1[A] * PS1[C]

    PS_NMADD    - negative multiply-add

        PS0[D] = - (PS0[A] * PS0[C] + PS0[B])
        PS1[D] = - (PS1[A] * PS1[C] + PS1[B])

    PS_NMSUB    - negative multiply-subtract

        PS0[D] = - (PS0[A] * PS0[C] - PS0[B])
        PS1[D] = - (PS1[A] * PS1[C] - PS1[B])

    PS_SEL      - select

        If (PS0[A] >= 0) then PS0[D] = PS0[C] else PS0[D] = PS0[B]
        If (PS1[A] >= 0) then PS1[D] = PS1[C] else PS1[D] = PS1[B]

    PS_SUM0     - vector sum high

        PS0[D] = PS0[A] + PS1[B]
        PS1[D] = PS1[C]

    PS_SUM1     - vector sum low

        PS0[D] = PS0[C]
        PS1[D] = PS0[A] + PS1[B]


        ... TODO
    Some of floating-point instructions changes behaviour, when PS is enabled.
    affect

    ---------------------------------------------------------------------------

    Written 2003, 2004 by ORG / Dolwin team. Last updated 28 Mar 2004.
