You should read Part 1 and Part 2 of this series if you haven’t already.
A problem with the HUB75 display reared it’s head almost as soon as I started playing about with MQTT: occasionally the screen would flicker, only slightly, but just enough to be noticed. At this stage I was driving the HUB75 pins with my own RP2040 code using a dedicated core. This required building my project with the (then) experimental SMP branch of the FreeRTOS project.
My working theory, which I have yet to fully confirm, is that there must be a lock in the networking stack somewhere. So, whilst the TCP/IP (or MQTT) code is running the other core is blocked from progressing and essentially halted. My method for investigating this phenomenon was crude: basically saturating the RP2040 with MQTT requests. When I do this the display is very flickery.
As previously mentioned, there are three tasks involved: the animation task for rendering into a framebuffer which will subsequently be sent to the matrix task, running on the other core, for display. A third task is responsible for all the network management and runs on the same core as the animation task. The FreeRTOS APIs used are xQueueOverwrite() in the animation (sender) task and xQueuePeek() in the matrix (receiver) task. The message queue is not a usual FreeRTOS message queue, since only one item (the framebuffer) can ever be pending. While waiting for the next frame buffer the receiving task will happily continue displaying the current framebuffer. That is to say it does not block and simply polls for a new framebuffer.
Note to self: a structural diagram showing the cores, tasks and messages would be very useful to have.
My initial approach to curing this flickering problem was to switch from RP2040 code directly manipulating the HUB75 pins to using the RP2040’s PIO mode to control them. Fortunately, the Pico SDK Example repo comes with a very nice example project for exactly this. I was extremely pleased that a ready-to-run PIO module for driving a HUB75 matrix was available because the amount of learning I’d have to have done to implement this myself would have been considerable. This was fairly easy to adapt, but unfortunately the result was only an improvement, not a complete cure. At some point in the future I’d love to explore the capabilities of the RP2040’s PIO mode further.
This left me with two choices:
- Live with the very occasional flickering. In real use, without flooding the RP2040 with MQTT packets, it was (and is) very, very slight. Barely noticeable in fact.
- Add an FPGA and drive the HUB75 pins with that, feeding the framebuffer to the FPGA using some kind of serial protocol.
The flickering bugged me. Even if it was so subtle that no one else looking at the display could see it, I knew it flickered and that was enough. I had to fix it.
At the same time as I was working on this project I was also working on my iCE40UP Development Board. This was unusual for me, as usually I will finish one project before getting onto the next. But it was a happy overlap: using an iCE40UP for driving a HUB75 LED matrix was a nice match:
- Not many FPGA pins would be needed: 14 for the HUB75 connector, and maybe 5 back to the Pi Pico W board for data. And perhaps a few more for diagnostic LEDs and the clock.
- Critically, the amount of block RAM was just about enough. See below for the calculation.
- Also importantly, I had some, and SPI flashes for the configuration, spare.
The principle concept behind the HUB75 control logic is that the FPGA would hold two framebuffers: one for the currently displayed image and one for the next image, which the RP2040 was feeding it on some kind of serial connection. This is a classic double buffering technique whereby the currently in use image (or sound buffer, etc) is not being altered while the next image (or sound buffer, etc) is being generated or received. Once the next image (or sound buffer, etc) is available the two are swapped.
The key consideration as to whether a Lattice iCE40UP (PDF) is suitable in this application is the amount of available block RAM. Looking at the table at the top of page 9 of the datasheet it can be seen that the iCE40UP5K has 120Kbit of EBR memory, EBR being short for Embedded Block RAM. This memory can be configured in a way Lattice refers to as Pseudo Dual Port. It is not true Dual Port memory because the two ports are not equal; you cannot have either side be a reader and a writer at the same time. Fortunately this does not matter in the HUB75 application, as there needs to be two distinct processes attached to the memory: one on the writing side, receiving the data from the RP2040, and one on the reading side, driving the HUB75 pins. Neither side needs to perform both operations on the memory.
Two frames of 64 pixels by 32 pixels using 32 bits per pixel (8 bits are unused per pixel to simplify the logic compared to going with a 24 bit per pixel frame) gives a total of 131,072 bits, which is just over the 120Kbit (122,880) available in the FPGA. The simple solution is to truncate the input data to 16 bits per pixel inside the FPGA. It is doubtful that the user will be able to notice any degradation in image quality; this is a 4mm pitch LED matrix after all.
I’ll now go over the hardware. The coding of the Verilog happened more or less in parallel with that of the hardware design.
The general idea was to design and build a “motherboard” which would hold the Pico W board, FPGA and a few other parts. That board would then go in a case, mounted behind the HUB75 panel. The principle features for this board are:
- 5V barrel jack to power the Pico W and the HUB75 panel. A pair of screw terminals would be used to wire in power to the HUB75 panel, replacing the bench PSU in use while I work on the project.
- Headers to receive the Pi Pico W board. This would need to go next to the board edge, so I could access the micro USB socket while the unit was all together so the Pi Pico W could be programmed in situ. Attached to the Pi Pico W pins:
- A DS3231 (PDF) I2C Real Time Clock (with integrated temperature sensor)
- CR1225 battery backup
- Buzzer
- A DS3231 (PDF) I2C Real Time Clock (with integrated temperature sensor)
- The iCE40UP5:
- N25Q032A (PDF) flash
- Diagnostic LED
- 50MHz oscillator can
Onto the schematic.
This is the Pico W section. As well as the two rows of 20 socket headers which receive the Pico W board pins, it has attached a DS3231 Real Time Clock IC and a buzzer. Because it might be useful, the pins for the RP2040’s UART are bought out onto a 4 way header, which includes Vcc. This is J2. J5 is a header for the I2C bus, which can be used to attach external I2C sensor boards and such like.
Because I decided I wanted options for how this board was going to be used, there are two sets of IDC16 HUB75 connectors across the whole board, one attached to the Pico W and one attached to the FPGA. In fact the FPGA can be omitted entirely; in this mode the aforementioned PIO mode is used to drive the HUB75 display. I did this because I wasn’t entirely sure I’d be able to get the FPGA design done and working. I also added options for how the HUB75 display board is powered. The two choices are screw terminals and a large berg-like connector which I do not know the name of. One of these connectors is present on the back of the display. This is J7 and J4.
The schematic for the rest of the board is very much a derivative of my iCE40UP development board project. The same headers and jumper arrangement is used for programming the FPGA and the SPI flash on this board.
I pondered making the SPI flash, and the FPGA, programmable directly from the Pico W on this board, which is ordinarily used to produce the image to display. In the end I opted not to peruse this option, but only because this board had enough “unknowns” already. This would have been a tidy solution for programming the FPGA, as I could have created a mode in my Matrix Display firmware which programmed the FPGA, possibly even with a bitstream embedded in the Pico W firmware, omitting the FPGA configuration flash from the board entirely. As it was I had to attach my SPI flashing Pico breadboard instead, whilst working on the Verilog code.
With that out of the way, the most interesting aspect is the mechanism by which the Pico W communicates with the FPGA to transfer the frame image. In the end I settled on a simple SPI-like connection. Four signals are used: CTRL_CLK, CTRL_MOSI, CTRL_MISO and ~CTRL_SS. There is also a ~CTRL_RESET, intended to force the FPGA to reset its state. This is useful if the RP2040 is reset without the FPGA being power cycled, as would happen when the RP2040 receives new firmware. The exact details of how the SPI connection is operated will be discussed later, when looking at the Verilog coding.
The next step was to work on the PCB design. Leveraging what I’d learned from designing my iCE40UP development board this was fairly straight forward.
Unlike the development board, the various iCE40UP power rails come in via copper pours. The internal power plane is used exclusively for 3.3V.
The mandatory 3D views:
And the back:
I was quite pleased with the design; it is compact and “tidy” but, full disclosure, this is the second iteration of the board. The first one had a number of issues, the main one being the problem with the RGB LED pins which my development board had, but unlike that board I wasn’t able to work around the issue with trace cuts. In this board’s case I had to spin a new board.
The (final) board design was duly ordered and soldered up:
The diligent will notice that this is v1.1, but the design is for v1.2. This is because the final design on GitHub includes a few further tweaks, beyond just fixing the problem with the RGB LED pins:
- The gap between the RTC battery and the Pico W socket was narrowed somewhat to make it easier to insert (and remove) the battery.
- The gap between the HUB75 power screw terminals and the berg-like connector was widened.
The board design is up on its own github repository.
With the physical hardware out of the way, it’s now time to look at the Verilog coding.
A key requirement for the coding I set myself right at the beginning was to have an end to end test bench. Not just tests of the individual modules, but a test that exercised the whole system. In this case I wanted to be able to feed the controller a test image over the serial input, just like the RP2040 would in practice, and then decode the HUB75 outputs and reconstitute the image it was generating. Whilst not a foolproof test that the controller will drive the display in the correct way, since my assumptions about how the real HUB75 protocol operates might actually be incorrect, it would at least prove that the controller code operated in the way I assumed it did.
Another minor goal was to make the code bit-depth agnostic. That is I wanted it to work with different bit depths, not just 16 bit. Whilst externally the SPI master would always feed a 32 bit deep framebuffer, I wanted the FPGA to support different reductions (or no reduction) in that bit depth, depending on the block RAM available and perhaps the size of the panel. This was achieved by using Verilog parameters. Completely arbitrary bit depths are impossible to support, but the code is usually built for 16 bits per pixel, with 8 bits per pixel occasionally tested. This would allow for bigger panels to be used, but only 4 levels of intensity would be available using an iCE40UP. To keep the explanation of the code simpler this feature of the code will be ignored and 16 bits per pixel will be assumed.
In terms of the serial protocol design, it is fairly simple. Pixels are clocked in, 32 bits at a time, with the end of frame marked by toggling the badly named ~CTRL_SS signal. The receiving and currently displayed frames will then be swapped. That is, the frame just received will start to be read by the HUB75 state machine and the next frame will be clocked in on the SPI bus to where the previously displayed fame was stored. An entire frame is 64Kbit, so assuming a 10MHz clock (a reasonable starting point for the Pico W SPI hardware and easily within the capabilities of the FPGA) yields around 150 frames a second, which is easily sufficient. Note that this is not the HUB75 display update rate, but the rate that the Pico W can send new frames to display, the HUB75 frame rate being independent of the rate it is receiving frames from the controller.
~CTRL_SS is directly wired into the top bit of the memory address. The compliment for this signal is used for the read operation. This logic is encapsulated in the sync_pdp_ram module.
The modules used in the design are as follows:
controller
This is the outer module and incorrectly also contains the state machine for driving the HUB75 signals. At some point I may refactor this module to move the HUB75 state machine, the guts of the controller really, into its own module.
With that in mind you can get a rough idea of the overall structure by looking at this RTL diagram:
(This diagram was generated by feeding the Verilog code through Quartus and uses it’s RTL Viewer feature. Bare in mind that Quartus obviously does not support Lattice FPGAs, and an Intel/Altera part was selected in order to generate a Quartus project. This may slightly distort the RTL representation compared to this code running on a Lattice part. Or it may make no difference and the RTL Viewer may purely be a Verilog interpretation tool. Someone with more knowledge than me will know which is correct and I’d certainly appreciate them enlightening me in the comments.)
The clk signal at the top left is the “reader” clock, ie. it drives the display. clk_counter[1..0] is a simple two bit counter. This reduces the read clock for the display by 4. With an input frequency of 50MHz, which is what my board has as it’s external clock source, this gives a 12.5MHz read clock. Since PWM requires multiple passes through each line it is not completely trivial to obtain a frame rate from this read clock. PWM will be described in some detail later on.
spi_slave
This is the module which receives the SPI signals from the Pico W board. It does the serial to parallel conversion, whilst also generating a write_clk output which is used to advanced the write address via the write_addr counter. This clock is flipped half way through reading the 32 bits of pixel data. It’s other job is to truncate the 32 bit, 4 by 8 bit input, which is the Pico W’s representation of a pixel, into the 16 bit, 4 by 4 bit representation which is stored in the block RAM.
Unfortunately the RTL diagram produced by Quartus was not a terribly useful aid to explain the functioning of this module. It’s probably best to look at the code instead.
sync_pdp_ram
Short for synchronous pseudo dual port RAM this is the double buffered, pseudo dual port memory wrapper module at the heart of the controller. Thankfully Quartus did produce a nice diagram for this module:
At this point it would be helpful to explain the construction of the two memory addresses in the system, the read address and the write address.
First of all, consider that 6 bits are required to hold a column, 0 to 63. 5 bits are required for a row, 0 to 31. 6 plus 5 makes 11 bits.
In the context of the external controller module, the write address is 11 bits wide. This is sufficient bits for 1 plane of 64×32 pixels, which is 2048 words. When writing, which occurs from the Pico W’s perspective from the top left of the display to the bottom right, the top bit, bit 10 of the write address, is used to determine whether this is a bottom or top write. The ~CTRL_SS input (buffer_toggle in the context of sync_pdb_ram) forms the top bit of the write address at the two memory arrays.
Reads are simpler. Since top and bottom reads happen in parallel the external controller module holds only a 10 bit wide address. This is made up of a 4 bit (0 to 15) row address and a 6 bit column address, plus the inverse of the ~CTRL_SS external input.
This may be simpler to explain by looking at the code:
You can see how bit 10 of write_addr is used to select between top and bottom halves, and how the buffer_toggle input makes up the top bit of the final memory address.
One interesting quirk of this code is the usage of a temporary synchronous signal which finds its way out combinatorially via the conditional assign statements. This may be excessive, the simpler approach of synchronously supplying the outputs directly perhaps being better. This is why there are flip flops on the outputs data outputs, viewable at the right of the RTL diagram.
Next, a description of the state machine which drives the HUB75 pins, and how PWM is achieved.
To understand how to approach this I actually cheated a little: my starting point was to look at how the PIO mode, available in the Pico-SDK example repository, worked. Instead of attempting to read the code, I simply attached my trusty Saleae Logic 16 analyser to the HUB75 header and made some captures:
PWM is achieved by repeating a row 8 times, once for each weighted bit of colour output. By looking at the width’s of the /OE pulses it is clear that the strobe doubles in length each time. The low order bits are outputted first and thus contribute least to the overall brightness because their /OE pulse is shortest.
For this FPGA implementation we only have 4 bits per red, green or blue. Also the PIO implementation is a little more complex then it has to be because the /OE pulse overlaps with the received data of the next row. The FPGA implementation I’ve settled on simplifies things by having the data being clocked in and the /OE pulse generation as separate “phases” of the processing.
In short, my equivalent waveform for clocking out one row four times, once for each bit at the requested intensity, looks like this:
A significant variable in this FSM is column_addr. This is used for two things: the low 6 bits count up the column position, from 0 to 63. All 10 bits are also used to count up the length of the /OE strobe.
Looking at the states required and there relationship, we end up with the following:
- READ_STATE_PIXELS:
- Sets the HUB75 clock running flag to true, as in this state we need the HUB75 (external) pixel clock to run
- Output the red, green and blue bits for the top and bottom half, using the data read from the sync_pdp_ram
- If the column count (which is appended to the row address to form the read address) reaches 64 then we leave this state and enter READ_STATE_SET_LATCH_DELAY
- READ_STATE_SET_LATCH_DELAY:
- The purpose of this state is just to insert a dummy clock before we latch the row
- Clear the HUB75 clock running flag
- Enter READ_STATE_SET_LATCH state
- READ_STATE_SET_LATCH:
- Set the HUB75 latch signal high
- Reset the column count to zero
- Enter READ_STATE_OE_STROBE state
- READ_STATE_OE_STROBE:
- Set the HUB75 latch signal low
- Set the HUB75 /OE signal low
- Increment the 10 bit column address
- If the column address has reached the current end count for the /OE strobe, then
- Set the /OE signal to high
- Enter the READ_STATE_END_OF_ROW state
- Otherwise stay in this state
- READ_STATE_END_OF_ROW:
- Increment the bit count, which is used by READ_STATE_PIXELS to extract the 6 bits we are interested in for the top and bottom HUB75 red, green and blue signals
- Double the size of the /OE pulse by shifting the maximum value of the counter which is tested against in READ_STATE_OE_STROBE left one bit position
- If we’ve output 4 bits then
- Clear the bit count to zero
- Set the length of the /OE strobe to the smallest value, experimentally determined to be 10’b0000011111
- Increment the row address value, to start the next row; note that this will just wrap on the end of the frame
- In any case, set the state to READ_STATE_NEXT_LINE
- READ_STATE_NEXT_LINE:
- Set the column address to 0
- Set the state to READ_STATE_SET_PIXELS to go around again
The above is only intended to give you a feel for the logic. To properly understand it it is necessary to read the code. The key point of the design is that the /OE strobe doubles in size for each bit position, giving weight (brightness) to the more significant bits of image data. AA further interesting point is that the least significant bit will have a /OE strobe of 31 clocks, with the /OE strobe length doubling up four times for each input bit position. This is based on a 12.5MHz clock and gives the expected level of brightness across the 16 different intensities.
Writing the testbench that reconstitutes a test image was probably more work than writing the actual controller.
Two helper scripts were written, in Go.
One script, tools/image-to-raw/image-to-raw.go, turns a 64×32 24 bit BMP image into a text file, each line in the file being 4 (red, green, blue and a dummy) 8 bit numbers in hex, one line for each pixel in the image. This file is suitable for loading into Verilog code using $readmemh().
Another script, tools/unscaled-to-image/unscaled-to-image.go, is a bit more involved. This script will read the output of the Verilog test bench, which is a collection of intensities for each pixel, and generate a BMP file. Because of the nature of the PWM mechanism, there is no predefined intensity for a pixel when it is at the maximum brightness. Instead this script will read in the entire image data set and then scale the brightnesses such that the highest intensity (essentially the count of clocks with the pixel turned on) is scaled to the maximum of an 8 bit value, 255 in decimal.
In terms of the test bench itself, it is divided into two stages.
First it clocks in the input image data. This is fairly simple: the file is read into memory using $readmemh() and then fed, a bit at a time, into the SPI-like inputs on the controller. Two loops are used, one for the pixel count (64 x 32) and one for the bit count (32).
To instruct the controller to start outputting this image on the HUB75 pins, the ~CTRL_SS input is moved high. Now the main clock can be toggled forever. The rest of the test bench operates using always blocks sensitive on various signals, typically the HUB75 clock output, which is cycled to clock out pixels by the controller.
Pulling in an individual HUB75 row is as simple as using 3 arrays, for red, green and blue and a counter. At this point we are just storing bits, not intensities. These 3 rows are copied into “latched” rows when the latch signal goes high, just as would happen inside the HUB75 PCB hardware latch ICs.
The last part of the test bench is to sum up these latched rows when the /OE signal stays low, using the row address output to determine the row currently being output. Once the testbench has seen that every row of the screen has been output, which is determined in a somewhat crass way by watching for the top row address bit going low, the intensity values for the entire screen is dumped out and the testbench is ended. In use, this output is written to a file for feeding into the unscaled-to-image.go script, and the resultant BMP file viewed.
The following is a contrived screenshot of the testbench in use:
At top right is the input image and bottom right is the output image. They are not entirely identical, since the output will be truncated to a 12 bit image.
This took quite a lot of iteration to get right, but it was certainly a lot faster iterating a testbench then it would have been to upload the bitstream to the FPGA and test it each time with real hardware.
The controller has its own repository, which is imaginatively named hub75-controller. It should be fairly easy to adapt it to other size screens and other FPGAs, bearing in mind the memory constraints of your choice of FPGA.
The last thing I had to work on was hardware, that is good old fashioned wood working with some 3D printing thrown in.
The basic idea was to mount the HUB75 panel to some oak, with the wires for the panel going through holes in the wood. Attached to the oak piece would be a 3D printed plastic case, with the PCB screwed into that.
I decided I would dust off my CAD skills and give it a go to model both pieces in Autodesk Fusion 360:
If you look closely you can see the PCB, with Pico W, inside the case. To ensure that everything would line up correctly I exported the 3D model inside KiCAD as a STEP file, and imported it into Fusion.
The round hole in the plastic is for the power jack, the other hole is for the micro USB connector on the Pico W board. The plastic case is attached to the wooden piece using machined screws going into inserts. I didn’t want to use wood screws as I’d have to remove the case to gain access to the PCB for occasional maintenance.
Making the plastic case was fairly trivial; it’s just a large 3D print. It took most of the day on my trusty (or should that be crusty now) Ender 3. This is an earlier iteration of the design, which wasn’t quite tall enough and required some other tweaks:
The woodwork was a lot more involved. I had to “re-commission” my old MPCNC CNC machine, which involved printing a few replacement parts, tightening the belts, and other tweaking.
There are a grand total of four operations required to machine out the part:
Adaptive clearing to step down the three slopes, a boring operation to cut out the screw holes, and a pair of 2D contour operations, one to cut out the 3 holes for the HUB75 cables and another to cut out the finished workpiece. Here’s a picture of the CNC machine doing its thing:
At this point the job was very nearly complete. I did have a few issues with the depth of cut not quite being sufficient, but this was easily fixed by operating the spindle “by hand”.
I decided not to sand off the very small steps on the angles as they are a reminder of how the part was made.
The last job was to put the whole thing together:
- First the PCB was attached to the inside of the plastic case. The leads for the HUB75 were attached in advance of being threaded through the holes in the wooden plate piece.
- Then the brass inserts were screwed down into smaller pre-drilled holes in the wooden piece.
- Then the plastic back cover was attached and screwed into the inserts using four machine screws.
- The HUB75 power and data cables where attached to the HUB75 panel.
- The HU75 panel itself was then attached using more machine screws, passing through the holes in the wood.
This picture shows an addition to the system I’ve not discussed here yet in the form of an I2C sensor board for temperature, humidity and air pressure. This is at the top right of the case in this picture. I’ll go over the details of this, as it’s pretty cool I think, in a future post.
Here’s a picture of the finished back of the unit, so you can see how it goes together. I’m pretty pleased with the combination of wood and plastic parts. The wood gives the display stability – it’s pretty heavy – and it looks nice too.
I’ll end with a re-link to my YouTube video, in case you missed it first time around:
If you got this far, thank you for reading! This must be my longest blog post by some margin.
The next post will be back onto the subject of FPGAs and softcores, I feel…