# ISAs revisited





# **Resources/readings**

Intel® 64 and IA-32 Architectures Software Developer Manuals

Arm A64 Instruction Set Architecture

RISC and CISC comparison paper (1991)

RISC vs CISC power struggles paper (2013)

<u>Agner Fog's instruction tables (for uop analysis)</u>





# Terminology review

ISA (Instruction set architecture): interface between high-level programming language and hardware

Instructions

Registers

Memory models

I/O model

Microarchitecture: hardware implementation of ISA

# Today: how do different ISAs compare? Does choice of ISA limit HW performance?

### Back in the day: accumulator architectures

Single-register architecture (register called "accumulator")

All arithmetic operations have the accumulator as source and destination

e.g. ADD 200 means: add value at mem. address 200 to accumulator and store result in accumulator

Born of necessity (registers were expensive!)

Fun reading: PDP-8 instruction set





## Different approaches to branching

How does new PC get determined?

im<u>aqe source</u>

RISCV jumps (not branches): direct (jal) and indirect (jalr)

How does comparison get computed?

RISCV: comparison and branch in one instruction

x86, arm: compare, then branch (cmp eax 0, jne end)

### Requires status/flag/condition code register

Related: predicated instructions (we'll come back to this for DLP)

| 63           | 32 |
|--------------|----|
| RESØ         |    |
|              | 0  |
| N Z C V RES0 |    |
|              |    |

|                                                                                                                                                                                                                                                                                                                             | 31                                                                                                                                                                                                                                            | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20    | 19  | 18 | 17     | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7  | 6  | 5 | 4  | 3 | 2  | 1 | 0  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|----|----|----|----|----|----|----|----|----|-------|-----|----|--------|----|----|----|----|----|----|----|---|---|----|----|---|----|---|----|---|----|
|                                                                                                                                                                                                                                                                                                                             | 0                                                                                                                                                                                                                                             | 0  | 0  | 0  | 0  | 0  | 0  | 0  | 0  | 0  | D  | V I P | VIF | AC | V<br>M | RF | 0  | NT |    |    | OF | DF | F | F | SF | ZF | 0 | AF | 0 | PF | 1 | CF |
| X Virtual Inter<br>X Virtual Inter<br>X Alignment O<br>X Virtual-8086<br>X Resume Fla<br>X Nested Tas<br>X I/O Privilege<br>S Overflow Fl<br>C Direction Fl<br>X Interrupt En<br>X Trap Flag (T<br>S Sign Flag (Z<br>S Auxiliary Ca<br>S Parity Flag (<br>S Carry Flag (<br>S Indicates a<br>X Indicates a<br>X Indicates a | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |    |    |    |    |    |    |    |    |    |    |       |     |    |        |    |    |    |    |    |    |    |   |   |    |    |   |    |   |    |   |    |

### Figure 3-8. EFLAGS Register

(from Intel manual linked on first slide)

(

# **Different** approaches to memory

Load/store or register-register architectures: all arithmetic operations done in registers (need to load from memory into register first)

RISC-V, MIPS, Arm

**Register-memory architectures**: arithmetic operations can be done using combination of registers and memory addrs

x86

# **Comparing 64-bit ADD instructions**

### **RISCV**

ADD, ADDW (32-bit add), ADDI, ADDIW

### Armv8

Same mnemonic (ADD), different machine instructions

ADD W0, W1, W2 (32-bit) ADD X0, X1, X2 (64-bit) ADD X0, X1, W2, SXTW (sign-extended) ADD X0, X1, #42

Each AArch64 64-bit general-purpose register (X0-X30) also has a 32-bit (W0-W30) form. Figure 4.2. 64-bit register with W and X access.

ımaae

source

### x86-64 ADD

#### ADD-Add

| Opcode           | Instruction                             | Op/<br>En | 64-bit<br>Mode | Compat/<br>Leg Mode | Description                                  |
|------------------|-----------------------------------------|-----------|----------------|---------------------|----------------------------------------------|
| 04 ib            | ADD AL, imm8                            | 1         | Valid          | Valid               | Add imm8 to AL.                              |
| 05 iw            | ADD AX, imm16                           | 1         | Valid          | Valid               | Add imm16 to AX.                             |
| 05 id            | ADD EAX, imm32                          | 1         | Valid          | Valid               | Add imm32 to EAX.                            |
| REX.W + 05 id    | ADD RAX, imm32                          | 1         | Valid          | N.E.                | Add imm32 sign-extended to 64-bits to RAX.   |
| 80 /0 ib         | ADD r/m8, imm8                          | MI        | Valid          | Valid               | Add imm8 to r/m8.                            |
| REX + 80 /0 ib   | ADD r/m8 <sup>*</sup> , imm8            | MI        | Valid          | N.E.                | Add sign-extended imm8 to r/m8.              |
| 81 /0 iw         | ADD r/m16, imm16                        | MI        | Valid          | Valid               | Add imm16 to r/m16.                          |
| 81 /0 id         | ADD r/m32, imm32                        | MI        | Valid          | Valid               | Add imm32 to r/m32.                          |
| REX.W + 81 /0 id | ADD r/m64, imm32                        | MI        | Valid          | N.E.                | Add imm32 sign-extended to 64-bits to r/m64. |
| 83 /0 ib         | ADD r/m16, imm8                         | MI        | Valid          | Valid               | Add sign-extended imm8 to r/m16.             |
| 83 /0 ib         | ADD r/m32, imm8                         | MI        | Valid          | Valid               | Add sign-extended imm8 to r/m32.             |
| REX.W + 83 /0 ib | ADD r/m64, imm8                         | MI        | Valid          | N.E.                | Add sign-extended imm8 to r/m64.             |
| 00 /r            | ADD r/m8, r8                            | MR        | Valid          | Valid               | Add r8 to r/m8.                              |
| REX + 00 /r      | ADD r/m8 <sup>*</sup> , r8 <sup>*</sup> | MR        | Valid          | N.E.                | Add r8 to r/m8.                              |
| 01 /r            | ADD r/m16, r16                          | MR        | Valid          | Valid               | Add r16 to r/m16.                            |
| 01 /r            | ADD r/m32, r32                          | MR        | Valid          | Valid               | Add r32 to r/m32.                            |
| REX.W + 01 /r    | ADD r/m64, r64                          | MR        | Valid          | N.E.                | Add r64 to r/m64.                            |
| 02 /r            | ADD r8, r/m8                            | RM        | Valid          | Valid               | Add r/m8 to r8.                              |
| REX + 02 /r      | ADD r8 <sup>*</sup> , r/m8 <sup>*</sup> | RM        | Valid          | N.E.                | Add r/m8 to r8.                              |
| 03 /r            | ADD r16, r/m16                          | RM        | Valid          | Valid               | Add r/m16 to r16.                            |
| 03 /r            | ADD r32, r/m32                          | RM        | Valid          | Valid               | Add r/m32 to r32.                            |
| REX.W + 03 /r    | ADD r64, r/m64                          | RM        | Valid          | N.E.                | Add r/m64 to r64.                            |

 $\bigcirc$ 

# ???

What would we have to add to our single-stage processor to implement the x86 add variations?



## Classification of ISAs: RISC/CISC

RISC-V, MIPS, Arm are **RISC** (Reduced Instruction Set Computer) architectures

x86 is **CISC** (Complex Instruction Set Computer) architecture

Typically a larger set of instructions

Allows register-memory instrs

Allows variable-length instruction encodings

Allows instructions that take longer than 1 cycle

*Except...* this distinction is becoming less useful

Blem et al. paper summed up: it's up to the microarchitecture



CISC

# What's typically the case?

 $\odot$ 

Larger code size

**Pipelining is harder** 

Single instruction takes more work

Fewer available GPRs

More complicated decoder



# Micro-ops

Translation of complex machine instruction (macro-op) to multiple steps Microarchitecture dependent (Intel doesn't provide documentation on this) For example, addq 8(%rdi) %rax might be translated into:

add 8 to rdi

load that address from memory

add that value to rax

store the result in rax

We will come back to this for out-of-order

**CISC becomes RISC-ier**: control unit has an easier job with each uop; in turn makes pipelining easier

In practice, compilers such as gcc also prioritize the RISC-er instructions (Blem et al paper)

# $\mathbf{O}$

# ????

Why does x86 have so many instructions?

# Extending ISAs

#### Chapter 21

#### Extending RISC-V

| Base      | Version |
|-----------|---------|
| RV32I     | 2.0     |
| RV32E     | 1.9     |
| RV64I     | 2.0     |
| RV128I    | 1.7     |
| Extension | Version |
| M         | 2.0     |
| A         | 2.0     |
| F         | 2.0     |
| D         | 2.0     |
| Q         | 2.0     |
| L         | 0.0     |
| C         | 2.0     |
| B         | 0.0     |
| J         | 0.0     |
| Т         | 0.0     |
| Р         | 0.1     |
| V         | 0.2     |
| N         | 1.1     |

### image source

16 Bit Thumb & 16/32 Bit Thumb2

 $( \bullet )$ 

### **Powerful & Scalable Instruction Set**

| VABS VADD VCHP<br>VLDR VHLA VHLS                                        | Floating Point                                                                      | ADIV VLDH<br>HUL VNEG                                          |
|-------------------------------------------------------------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------|
| VNMLA VHMLS VNMUL<br>VSUB VFMA VFMS                                     | VPOP VPOSH VSQRT                                                                    | Cortex-M4 FPU                                                  |
| PKH QADD F<br>QSUB QSUBIA DS<br>SHADDB SHASX SHLATT SHLAD S             | P (SIMD, fast MAC)                                                                  | QOSUB OSAX<br>SEL SHADD14<br>SHLABT SHLATB<br>SHLALD SHLAWB    |
| SHLAWT SHLSD SHLSLD<br>ADC ADD ADR<br>CLZ BFC BFI<br>CBNZ CBZ CHN CHP   | SHHILA   SHHILS   SHHUL     AND   ASR   B     BIC   COP   CLREX     DBG   EOR   LDC | SHULBY SHULBE<br>SHULET SHULT<br>SHULTE SHULWE<br>SHULWE SHUSD |
|                                                                         | ta processing<br>anipulations                                                       | SSATIA SSAX<br>SSUBIA SSUBA<br>SXTAB SXTABIA<br>SXTAH SXTBIA   |
| ORR PLD PLDW<br>RBIT REV REVIS                                          | PLI POP PUSH<br>REVSH ROR RRX<br>RSB SBC SBFX                                       | UADDI6 UADDI<br>UASX UHADDI6                                   |
| BKPT BLX ADC ADD ADR<br>BX CPS AND ASR B<br>DHB BL BL                   | SDIV SEV SHLAL<br>SHULL SSAT STC<br>STHIA STHDB STR                                 | UHADDE UHASX<br>UHSAX UHSUBI6<br>UHSUBE UHAAL                  |
| General data processing                                                 | STRB STRBT STRD<br>STREX STREXB STREXH<br>STRH STRHT STRT                           | UQADD16 UQADD8<br>UQASX UQSAX<br>UQSUB16 UQSUB8                |
| I/O control tasks                                                       | SUB SXTB SXTH<br>TBB TBH TEQ<br>TST UBFX UDIV                                       | USADI USADAI<br>USATIA USAX<br>USUBIA USUBI                    |
| SEV SXTB ASB SBC STH<br>SXTH UXTB STR STRB STRH<br>UXTH WVE SUB SVC TST | UMLAL UMULL USAT                                                                    | UXTAB UXTABIS                                                  |

# ???

Why might a variable-length instruction encoding be useful?

### **RISC becomes CISC-ier!**

## **RISC-V C extension**

### Sometimes used for embedded applications

 $\rm RVC$  uses a simple compression scheme that offers shorter 16-bit versions of common 32-bit RISC-V instructions when:

- the immediate or address offset is small, or
- one of the registers is the zero register (x0), the ABI link register (x1), or the ABI stack pointer (x2), or
- the destination register and the first source register are identical, or
- the registers used are the 8 most popular ones.

The C extension is compatible with all other standard instruction extensions. The C extension allows 16-bit instructions to be freely intermixed with 32-bit instructions, with the latter now able to start on any 16-bit boundary. With the addition of the C extension, JAL and JALR instructions will no longer raise an instruction misaligned exception.





# ???

# Why were new ISAs after about 1982 mostly RISC?

 $\mathbf{O}$ 

 $\odot$ 

### Arm vs. Intel: market incentives

<u>Arm claims</u> 99% of premium smartphones use their chips

Prioritizes low energy over performance

Licenses chips + gets royalties instead of manufacturing chips

Apple's switch from x86 to Arm driven by need for flexibility

Already were making Arm-based chips for iPhones and iPads

Don't need to rely on manufacturing issues (deadlines, quality control) of other company

Intel used to power through based on market dominance

IBM's adoption of Intel in the 80s fueled rise

Intel in turn was able to spend more \$\$ on R&D

### Concluding remarks from Blem et al

While our study shows that RISC and CISC ISA traits are irrelevant to power and performance characteristics of modern cores, ISAs continue to evolve to better support exposing workload-specific semantic information to the execution substrate. On x86, such changes include the transition to Intel64 (larger word sizes, optimized calling conventions and shared code support), wider vector extensions like AVX, integer crypto and security extensions (NX), hardware virtualization extensions and, more recently, architectural support for transactions in the form of HLE. Similarly, the ARM ISA has introduced shorter fixed length instructions for low power targets (Thumb), vector extensions (NEON), DSP and bytecode execution extensions (Jazelle DBX), Trustzone security, and hardware virtualization support. Thus, while ISA evolution has been continuous, it has focused on enabling specialization and has been largely agnostic of RISC or CISC. Other examples from recent research include extensions to allow the hardware to balance accuracy and reliability with energy efficiency [15, 13] and extensions to use specialized hardware for energy efficiency [18].