DE0-Nano reference
Introduction
Hardware
Address Map
J-Flash SPI
TCM / SDRAM memory
Instruction Cache (ICACHE)
ICACHE and Dhrystone
ICACHE and CoreMark
ICACHE and Crypto
ICACHE Conclusion
Download
 

Introduction

It is not the reference but my reference. That means it includes the parts that I find important for my tests here. Others will certainly find other parts more important. But that is the nice thing about the NEORV32 project, here everyone can put their system together as they need it.

And why the DE0-Nano and not the XYZ board? Compared to my other boards, the DE0-Nano is the cheapest board, and it is easier to use the second connector if the JTAG adapter is used on the other one.

I use Quartus II 64-bit 15.0.2 and SEGGER Embedded Studio for RISC-V as the development environment here. The NEORV32 was used in version v1.8.9.6.

Hardware

Here a DE0-Nano board with "JTAG Terasic Adapter" and SPI Flash was used:

The DE0-Nano provides a lot of functionality. But only the following subset was used:

  • Cyclone IV (EP4CE22F17C6)
  • Build in USB Blaster
  • 32 MByte SDRAM
  • 2 Push-button switches
  • 8 Green User LEDs
  • 50 MHz oscillator

The following options of the NEORV32 have been enabled:

  • Bootloader
  • On-Chip Debugger
  • Internal Instruction memory
  • Internal Data memory
  • Internal Instruction Cache (ICACHE)
  • External memory interface
  • GPIO
  • TRNG
  • UART0
  • Machine System Timer
  • General Purpose Timer

Address Map

The settings on the NEORV32 result in the following address map:

 Peripheral  Address Offset  Size (bytes)  Attribute
 On-Chip Debugger  0xFFFFFF00  256  OCD address space
 SYSINFO  0xFFFFFE00  256  System Information Memory
 GPIO  0xFFFFFC00  256  General Purpose Input / Output
 TRNG  0xFFFFFA00  256  True Random Number Generator
 UART0  0xFFFFF500  256  Primary UART
 MTIME  0xFFFFF400  256  Machine System Timer
 GPTMR  0xFFFFF100  256  General Purpose Timer
 Bootloader ROM  0xFFFFC000  8K  Bootloader address space
 SDRAM  0x90000000  32M  External SDRAM
 DMEM  0x80000000  16K  On-chip Data Memory
 IMEM  0x00000000  32K  On-chip Instruction Memory
 ICACHE  - - - - - - - - -  16K  On-chip Instruction Cache

J-Flash SPI

Unfortunately the J-Link EDU lacks the license for J-Flash SPI. If you have access to a J-Link with the right license, there is a project in the download area (de0n-spi) that connects the SPI Flash of
the "JTAG Terasic Adapter" directly to the JTAG.

TCM / SDRAM

The examples can be executed in TCM and SDRAM memory. Debugging may only work in SDRAM because the example could be too large for the TCM memory.

Instruction Cache (ICACHE)

The Instruction Cache can be set with the following 3 parameters:

  • ICACHE_NUM_BLOCKS
  • ICACHE_BLOCK_SIZE
  • ICACHE_ASSOCIATIVITY

At the beginning I had no idea what the optimal values are for this. That’s why I tried to determine the settings experimentally. For this I used the following 3 benchmarks:

  • Dhrystone
  • CoreMark
  • Crypto

The optimal settings are of course dependent on the application you are currently using. So the settings for A can be different than for B or C. At the beginning I only changed the parameters NUM_BLOCKS and BLOCK_SIZE. Later I took a closer look at the ASSOCIATIVITY parameter too.

The idea was to first run the benchmark in the TCM memory without ICACHE in order to determine the optimum performance value. Then the benchmark was run in SDRAM, without ICACHE and with ICACHE. Attempts were made to determine the optimal settings for the ICACHE which show the best performance.

I started here with the default settings 4x64 and then first changed the BLOCK_SIZE and then NUM_BLOCKS.

The benchmarks were compiled in release mode with "Link Time Optimization" and "Level 3 for more speed".

ICACHE and Dhrystone

Running in TCM memory there is a value of 0.82 DMIPS/MHz. And in SDRAM without ICACHE produce only a value of 0.21 DMIPS/MHz. Here is now an overview of the different values with the ICACHE:

  •   4x64   => 0.17
  •   8x64   => 0.23
  •   4x128 => 0.20
  •   8x128 => 0.27
  •   4x256 => 0.20
  •   8x256 => 0.44
  •   4x512 => 0.13
  •   8x512 => 0.13
  • 16x256 => 0.44

It looks like the optimal setting here is 8x256. Which gives a total size of 2K for the cache. A further increase to 16x256 does not bring any more performance increase.

With the ICACHE you can achieve a little more than half the performance from the TCM memory here in the SDRAM.

ICACHE and CoreMark

In case of CoreMark, the "Iterations/Sec" result has to be divided by the 100MHz at which the CPU is running. Only then do you get a value for CoreMark/MHz.

Running in TCM memory there is a value of 0.95 CoreMark/MHz. And in SDRAM without ICACHE produce only a value of 0.22 CoreMark/MHz. Here is now an overview of the different values with the ICACHE:

  •   4x64   => 0.26
  •   8x64   => 0.39
  •   4x128 => 0.31
  •   8x128 => 0.52
  •   4x256 => 0.44
  •   8x256 => 0.54
  •   4x512 => 0.52
  •   8x512 => 0.54
  • 16x256 => 0.55

It looks like the optimal setting here is 8x256. Which gives a total size of 2K for the cache. A further increase to 16x256 does not bring any noticeable increase in performance.

ICACHE and Crypto

This test consists of a Hash and ECDSA benchmark. Because of the size of the benchmark there are only values for the SDRAM without and with cache available.

Without the cache, the values look like this:

And with the cache, the values look like this:

With the ICACHE you can double the performance here in the SDRAM.

However, a cache with a size of 16K was also used here. The problem here was finding the right settings. First I tried to work with a setting of 16x1024 and then reducing the BLOCK_SIZE and increasing the NUM_BLOCKS. So values from 16x1024, 32x512, 64x256 up to 256x64 was used. It turned out that 256x64 was the optimal setting.

In the next step, I tested whether you could get even better values with ASSOCIATIVITY. And it turned out that with 128x64x2 the values could be improve even more.

ICACHE Conclusion

After these tests here I would say that you should use a value of 64 for the BLOCK_SIZE and try for the NUM_BLOCKS as much as possible. Furthermore, the performance can possibly be improved again with ASSOCIATIVITY.

Whether you need a cache of 2K or 16K depends on your application and you have to test it yourself.

Download

Quartus de0n-spi_20211229 project for the direct JTAG to SPI connection (16 KB)

de0n-spi.sof which simply has to be loaded into the FPGA with the programmer (687 KB)

The repository for this project can be found on GitHub at neorv32-de0n-ref.