As mentioned at the bottom of the previous post, I’ve been designing a new 32 bit processor, and this post is concerned with the design of that processor’s Instruction Set Architecture.
The objectives for this project are as follows:
- To learn as much as possible from the experience
- To leverage what I learned making my previous 16 bit processor
- To produce an ISA which is pleasant to write, by hand, assembly code for
- There are limits to this: I’m not interested in designing a super-CISC processor, even though the more CISC a processor is the easier it is, generally, to program in assembly
- But this is why this ISA has CISC characteristics, including things like a hardware stack
- To learn more about the LLVM compiler suite and attempt to write a target for my ISA
- Longer term, I want to explore what would be involved in adding a pipeline to this design, which might well involve scrapping this ISA and producing a new one
This design is strongly influenced by my previous 16 bit ISA project, and indeed this project could be described by looking at how it is different to my previous effort, though I will try to minimise references to my earlier project to make this as self-contained as possible. Some familiarly with my previous project would help to make sense of this post.
The processor’s current capabilities are as follows:
- 16 x 32 bit General Purpose Registers
- 32 bit Program Counter
- External busses are 30 bits for the address and 32 bits for data
- Internally addresses are a bye wide; the external address bus is 30 bits wide with 4 byte select outputs
- The external world must present a 32 bit wide databus
- Longword, Word and Byte access to 32 bit wide external memory/IO
- Sign or Zero extension for reads
- Unaligned accesses are not permitted
- The processor is Big Endian, just because I prefer it
- Carry, Zero, Negative and Overflow ALU flags
- The ALU operates on 32 bit quantities
- 32 bit instruction words have sufficient room for an embedded sign extended quantity for: immediate loads, ALU operations, displacements used by both loads/stores, and branching
- All the usual ALU operations are available, the most sophisticated being 16 bit to 32 bit signed and unsigned multiply
- The ALU uses a destination register with one or two (depending on operation) operand register(s)
- All control flow operations are conditional, including return
- A modified version of ARM’s 4 bit conditional encoding is used
- Jump and subroutine call through an address held in a register
- CustomASM is the currently used assembler
- The assembly syntax I’ve chosen has the destination on the left
As I write this blog the processor is running on my DE2-115 FPGA development board, and in simulation using GHDL, and I’m considering my next steps with the project. This blog is not an instruction manual for my processor. The best place to obtain detailed information is the GitHub repo, and the documentation, such as it currently is, there. But this blog will serve as a way to gather my thoughts on the project.
This post is concerned only with the design of the Instruction Set. Future posts will discuss the implementation of the processor, software, etc.
A critical part in any ISA design is the formatting of the instruction words. As usual it’s about trade-offs, usually efficiency vs functionality. My processor uses a 32 bit instruction word, to make it easy to implement. While many 32 bit processor have 16 bit instruction words – the MC68K is one – it makes the parsing of the instructions more complex as many instructions would end up needing multiple instruction words. On the other hand, a 32 bit instruction word can be wasteful as much of the instruction word, especially for the simpler instructions, can become unused. A case in point is the NOP instruction, which needs only a small portion of the 32 bits.
In the description of this ISA the term “quick” is used to describe an immediate value embedded in the instruction word. Often there are two alternative instructions which do the same thing, one which takes a full 32 bit immediate in a following longword, and a “quick” version. For instance loading the value 0x42, which fits in a 16 bit word, into a register could either be done in a two longword sequence, or a one longword one, using the quick load variant, since the value to be loaded fits in 16 bits. Note that some quick values are 16 bit, and some are 12 bit, depending on the opcode in question.
This processor uses seven different “formats” of instruction word. Each format code number is the first nybble of the 8 bit opcode, resulting in some gaps in the opcode space. The formats have no bearing on the internal structure of the processor, but are a way to get the various fields in the instructions into a recurring pattern when laying out the instruction words, which greatly improves efficiency and simplifies the coding.
Format 0: Base – NOP, HALT, ORFLAGS, ANDFLAGS
QUICK FLAGS holds the flags to and/or into the current condition codes. Only the low 4 bits is used.
The condition codes cannot currently be transferred to a General Purpose register, nor can they be stacked. This is something that will need to be corrected when support for interrupts is added.
Format 1: Load Immediate Long and Load Word Quick – LOADLI, LOADWSQ
For LOADWSQ (load, word, sign extended, quick) the 16 bit quick value is sign extended to 32 bits.
Format 2: Other Loads and Stores – LOADR, STORER, LOADM, STORM, LOADRD, STORERD, LOADPCD, STOREPCD, LOADRDQ, STORERDQ, LOADPCDQ, STOREPCDQ
This covers all the other load and stores. Generally this is a matrix of operations with the following characteristics:
- Load vs Store
- Quick vs immediate displacement
- PC vs a General Purpose Register as the base address
This also covers the load and store through an immediate memory address (LOADM and STOREM). It is also possible to load through a register without a displacement (LOADR). These are functionally the same as LOADRD and LOADRDQ with a displacement of 0 but usage of this opcode saves a cycle. This is mirrored in store opcode equivilants.
One instruction type I’m considering is a load/store with the displacement held in a register. That way you could do instructions like …
load.l r0,(r1,r2)
… which would calculate r1 + r2 and load the long at that address into r0. I’m not currently sure how well it would fit in with the existing instruction formats however.
Format 3: Flow Control – JUMP, BRANCH, BRANCHQ, JUMPR, CALLJUMP, CALLBRANCH, CALLBRANCHQ, CALLJUMPR, RETURN
The flow control operations are pretty exhaustive. All flow control operations are conditional, including return. The condition to test is encoded using a modified form of ARMs 4 bit condition code, and covers things such as zero set or not, and useful comparison tests for signed and unsigned quantities. The full set is currently documented in the project README.md. This saves 4 bits in the instruction word relative to my 16 bit processor design, which used an 8 bit wide field for its conditions: 4 bits for a mask and 4 bits for the required values on masked bits, without any loss in functionality.
The 4 bit condition codes do not quite match ARMs because I wanted “always” to be represented by the condition code 0, whereas ARM uses 0xe for this. Also, ARM uses a reversed carry flag for subtractions, whereas this processor does not.
It is possible, via JUMPR and CALLJUMPR, to jump and jump to a subroutine through an address held in a register, which was not implemented on my 16 bit processor. This was added to allow the programmer to make use of jump tables, which couldn’t be done before without tricks involving self-modifying code.
Register 15 is hardcoded by the assembler to be the stack pointer for all call and return operations, though this is only a programmer convenience; the implementation itself would allow the programmer to use other registers as stack pointers for subroutine calls and returns.
Two quick variants are provided, one for branching and one for calls to a subroutine through a branch. In both cases a 12 bit (byte) offset sign extended to 32 bits must be supplied.
Since instructions are always longword aligned I could multiply the allowed branch displacement by 4 by switching the quick displacement up two bit positions. The maximum displacement would increase from +/- 2KB to +/- 8KB.
One instruction missing which I may implement is Jump And Link. This would act like a jump, but store the previous Program Counter in a register. This would allow calls to short subroutines which could be entered and then returned (via the jump through a register operation) without any stacking overhead.
Format 4: ALU Operations – ALUM, ALUMI, ALUS
The ALU OP CODE contained in the instruction is 4 bits wide, though in the implementation the ALU uses a 5 bit code which also includes the OPCODE’s LSB in its MSB. This MSB switches the ALU between multiple operands (eg. the add operation) and the single operand operation (eg. the bitwise NOT operation). This complication is necessary to minimise some of the logic in other instructions.
The number of the operands, 2 and 3, is strange because of how the registers are decoded from fixed positions within the instruction word: register 1 is usually the destination register, register 2 is either an ALU operand register or it is used for addressing, and register 3 is the optional one.
Unlike the 16 bit processor, it is possible to execute instructions like the following:
add r0,r1,r2
This will add r1 to r2 and put the result in r0, leaving r1 and r2 unchanged. Taking advantage of this functionality is optional however; the assembler will do the right thing if given:
add r0,r1
Namely, it will add r1 to r0, putting the result in r0.
Format 5: ALU Quick operations – ALUMQ
This format is used by just one operation, ALUMQ; ALU operations with Multiple operands, Quick. This is a shorter, in terms of instruction words, version of the non-quick ALUMI operation and allows assembly code like the following:
addq r0,r1,#0x123
This will add 0x123 to r1, putting the result in r0. All done in a single instruction longword. Currently the programmer is responsible for selecting the quick instruction variants, using the q suffix on the applicable instructions.
Note that all of the multiple (two) operand ALU operations are available with the quick variant, even though some of them are less useful, eg. the logical OR operation. Since the value is sign extended from 12 bits to 32 bits, the upper 20 bits will either be a 0 or 1 depending on bit 11 of the input.
Format 6: Push and Pop including Multiple – PUSH, POP, PUSHMULTI, POPMULTI
There are two forms of register stacking operations available: single and “multi”. The singular type will stack or unstack one register at a time, adjusting the selected stack pointer accordingly. These operations use the the DATA REG field.
The multi variants are perhaps more interesting. When they are used the MULTI REGISTER MASK field selects which registers should be stacked or unstacked. Bit 0 in this field is used for r0, bit 1 for r1 and so on. Registers are stacked and unstacked in reverse orders, as expected. This allows many registers to be stacked or unstacked without issuing many push or pop instructions. However the multi variants should not be used routinely, as if only one or two registers are being stacked the single register operations are faster; the multi versions are only more efficient, clock count wise, when more than 2 registers are stacked.
The code, in VHDL, is all up on GitHub for anyone to take a look at, comment on etc.
The next post will go over the rationale behind this ISA design. It’ll also discuss some of the details of the hardware implementation, and it will speculate on what’s in store for this processor design in the future…