A 16 bit softcore processor: Implementation

      No Comments on A 16 bit softcore processor: Implementation

In my previous blog post I talked about some of my goals for a simple 16 bit processor implemented in VHDL and running on an FPGA.

Here is a diagram showing the top level design, as generated by the Quartus tools:

For clarity the Control Unit and associated muxes are not shown; the remaining parts, plus the muxes, constitute the datapath.

Including the Control Unit there are five major elements, plus the outer block which encapsulates everything:

  1. Register File: holds the eight 16 bit general purpose registers and provides change operations (write, clear, increment and decrement) on them, as well as read access to two of them at a time (which feed the ALU). Registers are orthogonal in the sense that the programmer can use them either for holding addresses (pointers) or data.
  2. Program Counter: a conventional PC with the ability to branch, using an offset, or jump, using the address directly.
  3. ALU: has a 5 bit operation selector providing the ability to add, subtract, and, or, shift etc. 22 of 32 operation slots are currently used.
  4. Control Unit: the heart of the processor, this entity generates the control signals for the other parts through a state machine driven by the opcode being executed at that time.
  5. Bus Interface: is responsible for marshaling data onto and off of the external (memory and peripheral) buses. It deals with an 8 bit write being presented to the correct half of the external 16 bit bus, and the generation of a bus error signal when an unaligned 16 bit transfer is attempted.

There are two other minor players:

  1. Instruction Register: holds the current instruction, including outputs for the instruction broken out into its major fields including the opcode, registers selected by the programmer etc. See the opcode map for more on the instruction formats used.
  2. Temporary Register: usually holds the immediate value read out of a trailing instruction word, but can be used for other purposes in the future.

The datapath joins everything together. Muxes are used to route busses into an input port from different sources. For instance registers are either loaded from memory or from the result of an ALU operation. The input selected for each mux is set by the Control Unit.

Register File

First up, the Register File entity. This contains the eight 16 bit registers:

entity registers is
    port (
        CLOCK : in STD_LOGIC;
        RESET : in STD_LOGIC;
        CLEAR : in STD_LOGIC;
        WRITE : in STD_LOGIC;
        INC : in STD_LOGIC;
        DEC : in STD_LOGIC;
        WRITE_INDEX : in T_REG_INDEX;  
        LEFT_OUTPUT : out T_REG;
        RIGHT_OUTPUT : out T_REG;
        INPUT : in T_REG
end entity;

LEFT_OUTPUT and RIGHT_OUTPUT are the core busses within the processor and are used to hook the register file into the Arithmetic and Logic Unit (ALU). Selection of a register comes from the instruction being executed at that moment, which will be described later.

The architecture for this part is fairly trivial:

architecture behavioral of registers is
    signal REGISTERS : T_REGS := (others => DEFAULT_REG);
    process (RESET, CLOCK)
        if (RESET = '1') then
            REGISTERS <= (others => DEFAULT_REG);
        elsif (CLOCK'Event and CLOCK = '1') then
            if (CLEAR = '1') then
                REGISTERS (to_integer(unsigned(WRITE_INDEX))) <= DEFAULT_REG;
            elsif (WRITE = '1') then
                REGISTERS (to_integer(unsigned(WRITE_INDEX))) <= INPUT;
            end if;
            if (INC = '1') then
                REGISTERS (to_integer(unsigned(INCDEC_INDEX))) <=
                REGISTERS (to_integer(unsigned(INCDEC_INDEX))) + 2;
            elsif (DEC = '1') then
                REGISTERS (to_integer(unsigned(INCDEC_INDEX))) <=
                REGISTERS (to_integer(unsigned(INCDEC_INDEX))) - 2;
            end if;
        end if;
    end process;

    LEFT_OUTPUT <= REGISTERS (to_integer(unsigned(READ_LEFT_INDEX)));
    RIGHT_OUTPUT <= REGISTERS (to_integer(unsigned(READ_RIGHT_INDEX)));
end architecture;

Changes to a register are clocked, whilst the two selected registers are continually read back. It is possible to clear a register via a dedicated signal.

A register can also be incremented or decremented by two. This is used by the stacking operations. Note that it is possible to set a new value in a register and increment or decrement a register in a single clock cycle.

The asynchronous RESET signal is used to clear all registers.

The testbench for this is not particularly extensive. A sample from it is as follows:

REGS_INPUT <= x"1234";
REGS_WRITE <= '1';


REGS_WRITE <= '0';


assert REGS_LEFT_OUTPUT = x"1234" and REGS_RIGHT_OUTPUT = x"1234"
    report "Read/Write of reg 0 failed" severity failure;

The registers are cleared via RESET, two registers are read and asserted on zero, then a value is set on a register before it is read back and asserted on the new value.

Program Counter

The Program Counter is another register and can be similarly written to, for the jump operation. It also has an increment signal, and can be signaled to branch by adding an offset to the current value:

entity programcounter is
    port (
        CLOCK : in STD_LOGIC;
        RESET : in STD_LOGIC;
        JUMP : in STD_LOGIC;
        BRANCH : in STD_LOGIC;
        INPUT : in T_REG;
        OUTPUT : out T_REG
end entity;

The implementation of this entity is as follows:

architecture behavioral of programcounter is
    signal PC : T_REG := DEFAULT_PC;
    process (RESET, CLOCK)
        if (RESET = '1') then
            PC <= DEFAULT_PC;
        elsif (CLOCK'Event and CLOCK = '1') then
            if (JUMP = '1') then
                PC <= INPUT;
            elsif (BRANCH = '1') then
                PC <= PC + INPUT;
            elsif (INCREMENT = '1') then
                PC <= PC + 2;
            end if;
        end if;
    end process;

    OUTPUT <= PC;
end architecture;

And the crux of the test bench:

RESET <= '1';
wait for 1 ns;
RESET <= '0';


assert PC_OUTPUT = x"0000"
    report "PC reset" severity failure;


assert PC_OUTPUT = x"0002"
    report "PC increment" severity failure;

PC_JUMP <= '1';
PC_INPUT <= x"1234";
PC_JUMP <= '0';

assert PC_OUTPUT = x"1234"
    report "PC jump" severity failure;

This tests the reset, increment, and jump functions.

The Instruction and Temporary registers are similar to the Program Counter  and are not worth going into detail here.


The ALU is perhaps more interesting. First the interface:

entity alu is
    port (
        CLOCK : STD_LOGIC;
        DO_OP : in STD_LOGIC;
        OP : in T_ALU_OP;
        LEFT, RIGHT : in STD_LOGIC_VECTOR (15 downto 0);
        CARRY_IN : in STD_LOGIC;
        RESULT : out STD_LOGIC_VECTOR (15 downto 0);
        CARRY_OUT : out STD_LOGIC;
        ZERO_OUT : out STD_LOGIC;
        NEG_OUT : out STD_LOGIC;
        OVER_OUT : out STD_LOGIC
end entity;

And the (simplified) implementation:

architecture behavioral of alu is
    process (CLOCK)
        variable TEMP_LEFT : STD_LOGIC_VECTOR (16 downto 0) := (others => '0');
        variable TEMP_RIGHT : STD_LOGIC_VECTOR (16 downto 0) := (others => '0');
        variable TEMP_RESUlT : STD_LOGIC_VECTOR (16 downto 0) := (others => '0');
        variable GIVE_RESULT : STD_LOGIC := '0';
        if (CLOCK'Event and CLOCK = '1') then
            if (DO_OP = '1') then
                GIVE_RESULT := '1';
                TEMP_LEFT := '0' & LEFT (15 downto 0);
                TEMP_RIGHT := '0' & RIGHT (15 downto 0);
                case OP is
                    when OP_ADD =>
                        TEMP_RESULT := TEMP_RIGHT + TEMP_LEFT;
                    when OP_ADDC =>
                        TEMP_RESULT := TEMP_RIGHT + TEMP_LEFT + CARRY_IN;
                    when OP_SUB =>
                        TEMP_RESULT := TEMP_RIGHT - TEMP_LEFT;
                    --- snip ---
                    when OP_TEST =>
                        TEMP_RESULT := TEMP_RIGHT;
                        GIVE_RESULT := '0';
                    when others =>
                        TEMP_RESULT := (others => '0');
                end case;
                if (GIVE_RESULT = '1') then
                    RESULT <= TEMP_RESULT (15 downto 0);
                    RESULT <= RIGHT;
                end if;
                CARRY_OUT <= TEMP_RESULT (16);
                if (TEMP_RESULT (15 downto 0) = x"0000") then
                     ZERO_OUT <= '1';
                     ZERO_OUT <= '0';
                 end if;
                 NEG_OUT <= TEMP_RESULT (15);
                 if (OP = OP_ADD or OP = OP_ADDC) then
                     if (TEMP_LEFT (15) /= TEMP_RESULT (15) and TEMP_RIGHT (15) /= TEMP_RESULT (15)) then
                         OVER_OUT <= '1';
                         OVER_OUT <= '0';
                     end if;
                 end if;
                 -- snip --
             end if;
        end if;
     end process;
end architecture;

This is relatively simple. The carry flag is a 17th bit at the front of the result. It will be ‘1’ if the result does not fit in the 16 bit result field. The negative flag is the 16th bit (ie. the sign bit), and the zero flag is set simply if the result is zero. The compare, test against zero, and bit test operations are different in that they discard the computation result. An overflow flag is also generated on arithmetic operations to indicate that the result of a signed computation is invalid because it has the wrong sign. For instance adding 0x7fff as a 16 bit two’s complement value, 32767 in decimal, to 0x0001 results in 0x8000 which is -32768 in decimal and clearly wrong.

The test bench runs through each operation, supplying inputs and checking the result and flags. For example, here are the ADD tests:

run_test(OP_ADD, x"0001", x"0002" ,'0', x"0003", '0', '0', '0', '0');
run_test(OP_ADDC, x"0001", x"0002" ,'0', x"0003", '0', '0', '0', '0');
run_test(OP_ADDC, x"0001", x"0002" ,'1', x"0004", '0', '0', '0', '0');
run_test(OP_ADD, x"ffff", x"0001" ,'0', x"0000", '1', '1', '0', '0');
run_test(OP_ADD, x"4000", x"4000" ,'0', x"8000", '0', '0', '1', '1');
run_test(OP_ADDC, x"ffff", x"0000" ,'1', x"0000", '1', '1', '0', '0');
run_test(OP_ADDC, x"8000", x"7fff" ,'0', x"ffff", '0', '0', '1', '0');
run_test(OP_ADDC, x"8000", x"7fff" ,'0', x"ffff", '0', '0', '1', '0');
run_test(OP_ADDC, x"7ffe", x"0001" ,'1', x"8000", '0', '0', '1', '1');

The arguments to run_test() are:

  1. Operation code
  2. Left and right inputs
  3. Carry in flag
  4. Expected result
  5. Expected carry out flag
  6. Expected zero flag
  7. Expected negative flag
  8. Expected overflow flag

Control Unit

The Control Unit is the core of the processor. It is too large to describe in detail here. Anyone interested should look at the code on github.

Its outputs are a mixture of control signals to another entity, like PC_INC to increment the Program Counter, and select signals to various muxes which control which busses go where. For instance a register can either be loaded with a value from memory or it can be loaded with the result of an ALU instruction.

It is in essence a state machine which runs through the classical four activities of a processor:

  1. Fetch
  2. Decode
  3. Execute
  4. Write-back

Instructions are split roughly 50/50 in needing either three or four clock ticks. The current opcodes defined are:


NOP: does nothing, predictably.
HALT: stops the processor until it is reset. This operates simply by keeping the Control Unit in the S_HALT state, and requires even less VHDL statements then the implementation of NOP.

Load and store

LOADI: loads the following immediate quantity into one of the 8 general purpose registers.
LOADR: loads a register with the value held in memory at the address held in another register.
LOADRD: as above but with an immediate 16 bit displacement. As per branches the displacement may wrap. The final address is calculated using the ALU.
LOADM: load a register with a value held in memory using the address found at the following word.
STORER: saves a register into the address held in another register.
STORERD: as above but with an immediate displacement.
STOREM: saves a register into the address found at the following word.

Loads can either be on whole words or on bytes that are sign or zero extended to full words. Internally all operations operate on words, so it is unlike the 68000 which leaves the upper portions of a register alone when doing narrowed arithmetic (or other) operations.

Stores do not need to be extended, but it is possible to select either byte or word operations. Bytes are written to the upper or lower word half, as dictated by the low order bit of the address. This particular logic is contained in the Bus Interface, the Control Unit merely passes along the “cycle type” obtained from the instruction word.

Arithmetic and Logic

ALUM and ALUMI: perform an ALU operation with a destination and an operand. With ALUM the operand comes from a register; with ALUMI it comes from an immediate 16 bit quantity.
ALUS: performs an ALU operation which does not require an operand, like Increment or Bitwise NOT.

The ALU operations, the logic for which is in the ALU entity, are as follows. First the operations using an operand:

  1. Add
  2. Add with Carry
  3. Subtract
  4. Subtract with Carry
  5. Bitwise AND
  6. Bitwise OR
  7. Bitwise XOR
  8. Copy register
  9. Bit-test
  10. Compare

Bit-test and compare do not modify the result but do set the status flags.

The copy operation duplicates a register’s value (the “operand”) intoto the destination, and is done in the ALU simply because it was the easiest way to implement it.

The operations that operate on the destination directly:

  1. Increment
  2. Increment by two
  3. Decrement
  4. Decrement by two
  5. Bitwise NOT
  6. Logical shift left
  7. Logical shift right
  8. Arithmetic shift left
  9. Arithmetic shift right
  10. Negate (invert the sign)
  11. Byte swap
  12. Test against zero

Operations set the status bits carry, zero and negative, as expected. The overflow bit is set after the arithmetic operations, as mentioned above.

Note that it is only after these instructions that the flag bits (carry etc) are changed; a load operation does not change the flags, unlike typical CISC processors.

Control Flow

JUMP: There are eight bits which set the conditions for a jump. If the conditions are not met, the jump does not occur and instead the next instruction is executed.
BRANCH: Same as above, but the Program Counter is adjusted by the immediate amount instead of the immediate value.

The condition flags are as follows:

4 bits of condition “care” flags: carry, zero, negative and overflow
4 bits of the required condition “polarity” flags, assuming the cares bit is set

For example, if the cares bits are “0110” and the polarity bits are “0010” then the zero flag must be clear and the negative flag must be set. It does not matter what state the other flags are in.

To perform an always jump or branch, the cares bits should be “0000”.

Using a compare ALU operation and a specially coded jump or branch, it is possible to do any comparison test. For example, the instruction:

  • compare r0,r1

Followed by a jump or branch using the above flags would perform the control flow operation only if the r1 register had a lower value then r0.

Stack Operations

PUSHQUICK: The source register is written into the memory pointed to by the destination register and that register is decremented.
POPQUICK: The reverse of the above.
CALLJUMP: The stack pointer is decremented, the current program counter is pushed onto the stack, and the immediate value following the CALLJUMP becomes the new Program Counter.
CALLBRANCH: The same as the above except the subroutine is branched to, ie the Program Counter has the immediate value added to it.
RETURN: The Program Counter is pulled off the stack and the stack pointer is incremented.

The reason the push and pop operations have QUICK in there name is because at one point there were MULTI push and popping stacking operations which operated on multiple registers at a time, in a similar way to MOVEM on the 68000.

The CALL and RETURN operations can operate on any arbitrary general purpose register, and use that as the stack pointer. But because this does not have very much utility, and because it would make for verbose call and return instructions, the assembler currently only uses r7 for this purpose.

Bus Interface

The Bus Interface sits between the core of the processor and the outside world (ie. memory and peripherals) and translates accesses between the core processor’s view of the address space, and the rest of the system.

The core processor’s view of the address space is as 65536 addresses each with a word or byte at them. However each word overlaps with the next one. Therefore to read the program stream, which is a sequence of words, the Program Counter must be incremented by two. The important point is that a byte read at any address is always right aligned into the word, ie. in the least significant half.

The external view of memory is as you’d expect and the same as the 68000: there is no external A0 pin, and instead byte wide accesses are controlled via an upper and lower strobe pin.

The external connections on the Bus Interface are as follows:

entity businterface is
    port (
        CLOCK : in STD_LOGIC;
        RESET : in STD_LOGIC;

        CPU_ADDRESS : in STD_LOGIC_VECTOR (15 downto 0);
        CPU_DATA_OUT : in STD_LOGIC_VECTOR (15 downto 0);
        CPU_DATA_IN : out STD_LOGIC_VECTOR (15 downto 0);
        CPU_READ : in STD_LOGIC;
        CPU_WRITE : in STD_LOGIC;

        BUSINTERFACE_ADDRESS : out STD_LOGIC_VECTOR (14 downto 0);
        BUSINTERFACE_DATA_IN : in STD_LOGIC_VECTOR (15 downto 0);
        BUSINTERFACE_DATA_OUT : out STD_LOGIC_VECTOR (15 downto 0);
end entity;

The CPU_ connections are for the inner CPU, the BUSINTERFACE_ connections are for the outside world.

CPU_BUSACTIVE indicates whether the bus is currently in use and CPU_CYCLETYTPE_BYTE indicates a byte transfer, if it is ‘1’ or a word transfer if it is ‘0’. The other CPU_ connectuions are self-explanatory.

BUSINTERFACE_ERROR indicates an unaligned word access, ie. a word request that would need two accesses (on opposite halves of the databus) to satisfy. The Bus Interface does not currently (and probably won’t ever) support this, and instead it asserts BUSINTERFACE_ERROR for that memory cycle.

The implementation is quite interesting:

architecture behavioral of businterface is
    process (CLOCK)
        if (CLOCK'Event and CLOCK = '1') then
            BUSINTERFACE_UPPER_DATA <= '0';
            BUSINTERFACE_LOWER_DATA <= '0';
            BUSINTERFACE_ERROR <= '0';

            -- Shift the address to being a word address, moving low bit to upper/lower indicators
            BUSINTERFACE_ADDRESS <= CPU_ADDRESS (15 downto 1);
            if (CPU_CYCLETYPE_BYTE = '0' or CPU_ADDRESS (0) = '0') then
                BUSINTERFACE_UPPER_DATA <= '1';
            end if;
            if (CPU_CYCLETYPE_BYTE = '0' or CPU_ADDRESS (0) = '1') then
                BUSINTERFACE_LOWER_DATA <= '1';
            end if;
            if (CPU_CYCLETYPE_BYTE = '0' and CPU_ADDRESS (0) = '1' and CPU_BUS_ACTIVE = '1') then
                BUSINTERFACE_ERROR <= '1';
            end if;

            if (CPU_CYCLETYPE_BYTE = '0' and CPU_ADDRESS (0) = '0') then
            elsif (CPU_ADDRESS (0) = '0') then
                BUSINTERFACE_DATA_OUT <= CPU_DATA_OUT (7 downto 0) & x"ff";
                CPU_DATA_IN <= x"ff" & BUSINTERFACE_DATA_IN (15 downto 8);
            elsif (CPU_ADDRESS (0) = '1') then
                BUSINTERFACE_DATA_OUT <= x"ff" & CPU_DATA_OUT (7 downto 0);
                CPU_DATA_IN <= x"ff" & BUSINTERFACE_DATA_IN (7 downto 0);
                BUSINTERFACE_DATA_OUT <= x"ffff";
                CPU_DATA_IN <= x"ffff";
            end if;

        end if;
    end process;
end architecture;

A bus error can be detected by a word wide access (CPU_CYCLETYPE_BYTE is ‘0’) and the address is on an odd address (the low bit of CPU_ADDRESS is ‘1’).

Formatting the BUSINTERFACE_DATA_OUT for word accesses is trivial. For bytes it is done by putting the low half of CPU_DATA_OUT on the correct half. It is necessary to use the low half of the CPU_DATA_OUT because, as stated above, the core processor’s view of data is always that it is right aligned. This covers the processor performing a write operation.

For word wide reads, the CPU_DATA_IN can be directly copied from BUSINTERFACE_DATA_IN. For byte wide reads, the upper half is always a dummy value (x”ff”), with the lower half coming from relevant half of the BUSINTERFACE_DATA_IN word.

Data Path (External Entity)

The external entity’s connections map quite well to the pins on a conventional microprocessor, such as the 68000. The exception is the databus: instead of being bidirectional, there are in and out busses.

In terms of clocks, the processor has an external clock (CLOCK) and a generated clock (CLOCK_MAIN). The external clock is used by the Bus Interface and external synchronous memories, ie. on chip FPGA memory arrays. CLOCK_MAIN is 4 times slower and clocks the rest of the system including the Control Unit.

HALTED is asserted when the processor runs the HALT instruction, and is intended to drive an LED or similar.

The interface to the processor is defined by the following entity:

entity cpu is
    port (
        CLOCK : in STD_LOGIC;
        CLOCK_MAIN : out STD_LOGIC;
        RESET : in STD_LOGIC;
        ADDRESS : out STD_LOGIC_VECTOR (14 downto 0);
        UPPER_DATA : out STD_LOGIC;
        LOWER_DATA : out STD_LOGIC;
        DATA_IN : in STD_LOGIC_VECTOR (15 downto 0);
        DATA_OUT : out STD_LOGIC_VECTOR (15 downto 0);
        BUS_ERROR : out STD_LOGIC;
        READ : out STD_LOGIC;
        WRITE : out STD_LOGIC;
        HALTED : out STD_LOGIC
end entity;

The architecture for the Data Path mostly consists of instantiations of the various units of the design eg, the Register File, ALU, etc. One thing it also contains is the sign and zero extension logic:

    (8 to 15 => CPU_DATA_IN (7)) & CPU_DATA_IN (7 downto 0) when
    (8 to 15 => '0') & CPU_DATA_IN (7 downto 0) when

For sign extension, bits 8 to 15 of CPU_DATA_IN_EXTENDED are copied from bit 7 (the sign bit of the byte) of CPU_DATA_IN. with the remaining eight bits being a direct copy. For unsigned extensions the upper bits are set to zero. Word reads are of course simple copies.

CPU_CYCLETYPE_BYTE <= '1' when (
) else '0';
CPU_BUS_ACTIVE <= '1' when (CPU_READ = '1' or CPU_WRITE = '1') else '0';

Here, CPU_CYCLETYPE_BYTE, which is used by the Bus Interface, is high when a signed or unsigned byte transfer is performed, otherwise it is low. CPU_BUS_ACTIVE is high when a read or write operation is active. Otherwise, eg. the Control Unit is running an ALU operation, the output is low.

This concludes the discussion on the implementation of the processor proper. You can find more documentation, including a complete Opcode Map on the projects homepage on github at https://github.com/aslak3/cpu.

I have also made a short video going over the design. It covers most of the things outlined in this post.

In terms of actually using the processor I’ve designed and built, I have included it in a wider design incorporating the following parts:

  • Processor
  • VGA display
  • PS/2 port with attached keyboard
  • LEDs, seven segment display, buttons

The VGA display and the PS/2 interface both borrow code from my MAXI000 project. In the case of the VGA display however I have extended it somewhat from the text and bitmap modes I’ve previously written about, and have implemented a tile mode.

The tile mode uses 16 by 16 tiles, where each pixel is one of 16 colours. The purpose of this mode is to allow me to implement a Snake game:

This video also shows a further upgrade to my FPGA setup: a new development board. Though my processor works fine on the £35 board I bought from eBay, I thought it would be interesting to run it on a bigger board: the Terasic DE2-115. This is a terrific, and rather expensive board, with a large FPGA and many, many peripheral ICs and connectors:

  • Altera Cyclone IV 4CE115 FPGA device
  • USB Blaster (PDF) (on board) for programming; both JTAG and Active Serial (AS) programming modes are supported
  • 2MB SRAM
  • Two 64MB SDRAM
  • 8MB Flash memory
  • SD Card socket
  • 4 Push-buttons, 18 Slide switches
  • 18 Red user LEDs, 9 Green user LEDs
  • 50MHz oscillator for clock sources
  • 24-bit CD-quality audio codec IC
  • VGA DAC (8-bit high-speed triple DACs) with VGA-out connector
  • 2 Gigabit Ethernet PHY with RJ45 connectors
  • USB Host/Slave Controller with USB type A and type B connectors
  • RS-232 transceiver and 9-pin connector
  • PS/2 port
  • Expansion ports

In short, it has pretty much everything I could ever need. About the only thing it doesn’t have is a buzzer. In terms of the FPGA, it has (as the name of the board implies) 115K Logic Elements. For a point of reference, a 68000 design requires around 6K LEs, and all the hardware required to  implement my Snake game requires around 1.5K LE.

The next post will cover the software side of my processor; what assembler I’m using and how the Snake game was implemented.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.