The design of a 32 bit pipelined processor core in Verilog

The initial target for my iCE40UP devboard was and is a softcore processor design.

I’ve worked on two softcores in the past:

  1. A 16 bit core implemented in VHDL: this core features 8 by 16 bit registers, a 16 bit external design (that is, external 16 bit address and data busses), and a fairly complete set of ALU operations. It’s a mixed RISC and CISC design in that it has a fully orthogonal instruction set like a RSIC design but it has the classic CISC trait of explicit stacking and program flow instructions.
  2. A 32 bit extension of the above: Also in VHDL, this had a few enhancements, beyond the wider data widths throughout, but was very similar conceptually to the 16 bit design that preceded it.

The first design was fairly comprehensively explored; I even implemented a working bit of software for it – a Snake game – and wrapped the processor in other FPGA hardware, including a VGA display. The 32 bit core, however, died on the vine. I got as far as running some trivial sample programs under simulation, but I lost interest in the design. I did, though, learn a few things from it.

Implementation wise both these designs had a very simple and inefficient Control Unit design: multiple, between 2 and 5, clocks were required to execute an instruction, the so called multicycle microarchitecture approach.

With the new development board up and running, I decided it was time to resurrect my softcore projects.

The high-level goals for the design, which I decided to actually give name to this time, MaxiCore32, were as follows:

  • 32 bit throughout.
  • 16 registers, giving a 4 bit wide selector in the 32 bit instruction word; while 8 registers would have been enough 16 gives more “leg room” and makes coding more pleasant.
  • Pipelined: in a big step up in complexity, I decided to try implementing a pipelined Control Unit.
  • If in doubt, follow RISC principles at the expense of making the ISA harder to program.
    • This would ease the creation of the pipeline.
  • All the usual ALU functionality from the previous 32 bit design, but with things added as and when they proved useful for real code.
  • Implemented in Verilog.

If it wasn’t for the pipeline aspects, this project would not really be worthy of a blog post. But as it happens I’m extremely pleased to get to the point of running real code on a pipelined processor, following the above goals.

A pipeline allows the processor to start an instruction on every clock period, though multiple clocks are still required to complete an instruction. This obviously complicates the design considerably since an instruction is no longer a sequence of lower-level operations executed to completion before starting the next instruction. Instead parts of an instruction are executed in parallel with subsequent instructions; thus a read from a register to setup an ALU operation might happen at exactly the same time as a write into a register. In my design there are only two stages in the pipeline: a setup stage and a completion (write-back) stage.

With a pipeline multiple parts of the processor can be kept busy at the same time, instead of being idle waiting for the next instruction which needs to make use of them. This increases instruction throughput without requiring the clock rate to be increased.

The instruction pipelining aspects will be discussed later in this post when looking at some example code, but since processor pipelining is such a complex topic which I could not do justice to, I suggest you check out some interesting videos on this fascinating area:

In previous posts I’ve discussed the details of my softcore designs going into excruciating details of instruction formatting and other details. This post will instead go over its high level structure and describe how the processor is used in a bigger system.

First a complete list of features currently implemented by this softcore design. The ISA and the microarchitecure are here considered together:

  • 32 bit address and data busses.
    • Byte addressable memory.
  • 16 by 32 bit registers, all registers are equal and each can be used for addresses or data.
  • Two stage pipeline; NOPs must be manually inserted in code to avoid data hazards.
    • However control hazards are catered for in the microarchtecture by the automatic insertion of NOPs into the pipeline after a branch or jump.
  • No multiple word instructions are used; all instructions are rolled into one 32 bit word.
  • Load 16 bit immediate: signed, unsigned, or top or bottom half only.
    • Thus a 32 bit immediate load would require two instructions, however these are surprisingly rare in practice.
  • Memory loads can be sign or zero extended from bytes or words; memory stores can either be to longs, words or bytes. Unaligned memory operations are not allowed.
  • Load or store to or from an address held in a register with an embedded 16 bit sign extended displacement.
  • Same as above but with a displacement obtained from a register.
  • No load or store to or from an immediate memory address.
    • Again, this is rarely needed.
  • No stacking opcodes
    • This can be worked around easily: by advancing the programmer nominated stack pointer with the ALU and then accessing memory using embedded offsets from that register.
  • The ALU always operates on 32 bit quantities.
    • All the usual ALU operations with either a register or a 15 bit immediate as the second operand.
    • Some “fringe case” ALU operations including byte (8 bit) rotates in both directions to make up for the lack of a barrel shifter.
    • Only ALU opcodes update condition bits.
  • Subroutines facilitated by a Jump-And-Link style instruction which loads a register with the old PC before the PC is adjusted.
    • As a consequence, nested subroutine calls need either manual return address stacking or multiple registers set aside to hold the intermediate return addresses.
  • Branching is via a 16 bit sign extended displacement.
  • Return (and jump tables) implemented with jump-to-register operation.
  • All control flow instructions are conditional using a 4 bit sequence borrowed from ARM.
  • No interrupts (yet).
  • Assembler implementation uses CustomASM with similar opcode names to the other two cores.

The above results in a fairly pleasant ISA, for the assembly programmer, all things considered. The biggest thing lacking is explicit stacking operations. Some of this could be compensated for through the use of macros.

Here’s a sample routine showing some of the characteristics of the design:

; return the key pressed in r0, or 0. return with flags set according to test r0
readkeybd:  load.bu r0,PS2_STATUS_OF(r11)      ; get status of ps/2 port
            nop
            bit r0,r0,0x80                     ; top bit is data ready
            nop
            branch.eq .nokey                   ; no data, then out
            load.bu r0,PS2_SCANCODE_OF(r11)    ; get current scancode
            load.wu r8,last_key-vars(r13)      ; retrieve the last scancode we got
            store.w last_key-vars(r13),r0      ; now overwrite it with current one
            compare r8,r8,KEY_BREAK            ; see if LAST scancode was a break (0xf0)
            nop
            branch.eq .nokey                   ; if it was, we ignore what we just got
            compare r0,r0,KEY_BREAK            ; now see if THIS key is a break
            nop
            branch.eq .nokey                   ; if it was, then ignore that too
            test r0,r0                         ; exit with flags of key state
            jump r14
.nokey:     loadi.u r0,0                       ; return 0 unless we got a real key down
            jump r14

The NOPs are to compensate for the data hazards which the design does not have hardware to avoid. This is termed a delay slot: since writing the result of a load into a register is not completed until after the end of the second pipeline stage, when the next instruction is already running in the first pipeline stage, it is necessary to delay the ALU operation (bit test, in the first example from the code above) by one cycle with a NOP. Likewise after an ALU operation the result of that ALU operation isn’t written into the status register bits until the second stage completes. This is why a further delay slot is needed  before the conditional branch (the fifth instruction in the routine above).

The requirement for delay slots using NOP instructions certainly makes assembly programming more challenging, since the programmer has the ability, with careful thought, to minimise them. And minimising them is good since these are wasted cycles.

Here’s an example of some code which was structured to minimise delay slot NOPs:

; playerdead - oh dear, sound tone, reduce the lives, update the status display and halt if lives is zero now
playerdead: loadi.u r1,0x8000
            loadi.u r2,0x4000
            store.l TONEGEN_PERIOD_OF(r11),r1
            load.wu r1,lives_left-vars(r13)        ; load lives left
            store.l TONEGEN_DURATION_OF(r11),r2    ; sound death tone
            mulu r2,r1,4                           ; turn it into a long address in r2
            sub r1,r1,1                            ; reduce lives by one (r1)
            ....

Here we need to write to two hardware registers: one sets the tone and one sets the duration of the note to play on the sounder. Ordinarily this would be coded up as a sequence of two load and stores using a single register, but by using two registers it’s possible to avoid a delay slot after the first load. Further to this, the multiply of r1 by 4 has a delay before it in the form of a store which, in preference to a NOP, does useful work.

The above is an example of manual instruction reordering to avoid wasted instruction cycles. Modern processors do this operation in hardware, but in my softcore it is down to the programmer (me) to perform this function, if I want to optimise my code for execution speed.

There are other types of hazard which are dealt with in hardware: one is control hazards. Ordinarily a branch would involve writing into the Program Counter on the second stage, which would mean that the next instruction loaded into the first pipeline stage would be the one immediately following the branch in memory, and not the one at the branch target. This would mean every single branch or jump would need to have a NOP following it. Since it was fairly easy to implement, this is worked around in hardware: if the first stage is executing a branch or jump instruction, a “fake” NOP is inserted into the pipeline.

This automatic delay slot insertion came about because one type of automatic delay slot insertion is pretty much mandatory: if a memory operation is being executed the next instruction must be delayed because the memory bus can’t be accessed to retrieve the instruction as it is being used to read or write data, an example of the von Neumann bottleneck.

In fact there is one more instruction that benefits from automatic delay slot insertion: the HALT instruction. Halting a pipelined processor is more involved then a non pipelined processor because other instructions will be partially executed in different pipeline stages. My solution to that problem is to insert NOPs after the HALT instruction and run the processor for a few clocks after the HALT is detected by the first pipeline stage.

The biggest problem for this design, when running on real FPGA hardware, with the use of NOPs in program code to avoid data hazards is not execution speed but code density. In the current “application”, discussed below, a 500 or so instruction program has around 100 NOP instructions in it, yielding around 20% wastage. Not awful but not great. The processor is still significantly faster then my earlier non-pipelined designs.

There are solutions, with different complexity levels, to mitigating data hazards by automatically inserting NOPs instead of burdening the programmer (or compiler writer) with this task. For instance, the classical method to detect data hazards is to determine the data dependencies between instructions.

For example a load into r1 followed by an ALU operation that needs the value in r1 would require a stall, but if the second instruction needed the value in r2 a stall would not be required. This is possible by devising hardware to compare operands used by chains of instructions.

Another approach to removing the NOP bubble altogether is to employ operand forwarding. Using this technique the Control Unit determines that the result of a current instruction, such as an ALU operation, should be presented directly to the next instruction, such as a memory store instead of obtaining it from the Register File, effectively shortcutting the register write.

I have not yet looked seriously at implementing either of these solutions. Both look, on the surface, quite involved. But hopefully I’m up to the challenge as I intend to tackle this problem in the future.

Beside the pipeline the structure of the processor and it’s major components is fairly comparable to my previous designs, indeed some of the modules, like the Register File are pretty much indistinguishable, and you if are interested in them I suggest you have a look at my earlier blog posts on the 16 and 32 bit cores.

Here’s a RTL diagram of MaxiCore32 with the logic present at its top level removed:Click the image for a larger view.

The two stages of the pipeline are the memorystage1 and registerstage2 modules.

This RTL diagram was generated from Altera/Intel Quartus and was produced by deleting some logic present at the top level of the processor, to simplify the diagram.

One of the things I was keen to do with this softcore was to see the design running in an FPGA at the earliest possible point. I’m pretty pleased with my solution: my own version of the game Boulderdash can run on the core in an FPGA:

The Verilog for this core, and surrounding modules, is up on GitHub for anyone interested. Note that the documentation is pretty shocking at present. Hopefully soon I will produce a decent manual for the design.

Along with the processor the following components have been implemented:

  • Tiled VGA display (Verilog code): each tile is 16 x 16 pixels and 16 colours. Because I’m not an artist, I borrowed the tile images from the NES version of the game.
  • I2C master (Verilog code): This is a “port” of my VHDL project which implements an I2C master, as used by MAXI030.
  • PS/2 controller (Verilog code): This is another port of code previously written in VHDL that implements a PS/2 controller. Unlike the MAXI030 version, this project does not require output of bytes on the PS/2 port, since it only needs to support keyboards and not mice.
  • Tone generator (Verilog code): A simple tone generator for the buzzer. Pitch and duration can be set. Like the other controllers, the user must poll for the end of playback; there is currently no interrupt.

Here’s a simplified view of the top of the board level:

Again, you can click the image for a larger view.

The VGA portion is quite interesting as it encapsulates not just the VGA sync generators but a total of four memories instantiated in block RAM:

  1. Map: A map memory describing what tile to show at each position in the map.
  2. Status: Another tile memory holding what to show in the status line which appears at the top row of the display.
  3. Tiles: The pixel data for each tile. 64 tile slots are available though some are unused.
  4. Palette: The palette table. 16 colours are defined.

The map memory is the most expensive in terms of FPGA resources since it has to be duplicated. Since the processor needs read and write to it (to determine if collisions have occurred), and the iCE40 supports only pseudo-dual port memory, the two memories have to be updated on each processor write:

One further ability of the display design hardware is that the view is scrollable. The map is 32 by 32 tiles, with each tile being 32 doubled up 16 tile pixels square. This yields a view, at 640×480, of 20 by 15 tiles. Two registers are exposed to scroll the view, with the game code updating this position to keep the player as close to the middle of the screen as possible.

The status memory is a simplified version of the map memory. Because it does not have to read back the status line to the processor it is implemented as a simple pseudo dual port memory; read from the display generator side and written from the processor side. The other two memories, for the tile pixel data itself and the palette, are simple ROMs not changed at runtime.

In use this all works very well, though I’m at the limit of the block RAM resources of the iCE40UP5K.

A further complication was that I wanted my game to support multiple map levels, and there was insufficient memory in the block RAM available in the FPGA to support this. So I leveraged the I2C EEPROM on my iCE40UP5 development board, an 8KByte AT24C64 (PDF). This required writing another program for populating the EEPROM with level data, and of course writing the assembly code to read the levels out of the EEPROM from within the game.

Writing the actual game was obviously the most fun part of the project!

The mechanics are only loosely based on the NES (or any other) version (YouTube), which I can’t actually remember ever playing.

The following game mechanics are implemented:

  1. The player can tunnel through “dirt” turning it into empty space. He cannot tunnel through walls, which is what prevents the player from moving off the edge of the 32×32 tile map.
  2. Boulders (and gems) will fall downwards if the tile below them is empty (or the player!)
  3. Players can push boulders sideways, if the tile on the far side of the boulder is empty.
  4. Boulders will “roll” off another boulder below it pseudo-randmly choosing to slide left or right.
  5. Boulders can squash the player if he tunnels under a boulder and moves in a way that would let a boulder fall onto him.
  6. Players need to collect gems by moving into them.
  7. After a set number (level dependent) of gems have been collected the exit to the next level will open up.
  8. Bats have been half-implemented: they will move around the map but cannot yet kill the player.
  9. Bats and gems are animated and will cycle through 4 frames of animation.

All in all it’s surprisingly playable, as I tried to demonstrate in a video I posted on YouTube:

Because I fancied another challenge, and it might make it more likely that other people will be interested in utilising this work, my softcore, VGA and the Boulderdash game has been ported to the Terasic DE2-115 Cyclone 4-based development board. This was achieved by structuring the components in as modular fashion as possible, and then instantiating the modules in a “board level” top level module. Then, to port the entire design to a different FPGA target board only the board level module has to be re-implemented. Currently a top level module is only about 250 lines. Obviously there are unavoidable hardware restrictions for some targets: the DE2-115 has a smaller I2C EEPROM, and unfortunately lacks a buzzer.

With both Lattice and Altera part-equipped boards as targets, there’s not much to stop these designs being ported to any FPGA dev board, assuming adequate FPGA and other hardware resources.

After getting a liking to working on softcores I’ve decided that they will now be my focus for a while. But I want to mix it up, so before working on improving the processor core I’ve decided I need another new development board to run them on. One that will let me play with some modern hardware I’ve resisted looking at since I started the projects documented here all those years ago

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.