Introduction
It is not the reference but my reference. That means it includes the parts that I find
important for my tests here. Others will certainly find other parts more important. But
that is the nice thing about the NEORV32
project, here everyone can put their system together as they need it.
And why the DE0-Nano and not the XYZ board? Compared to my
other boards, the DE0-Nano is the
cheapest board, and it is easier to use the second connector if the JTAG adapter
is used on the other one.
I use Quartus II 64-bit 15.0.2 and SEGGER Embedded Studio for RISC-V as the development
environment here. The NEORV32 was used in version v1.8.9.6.
Hardware
Here a DE0-Nano board with "JTAG Terasic Adapter"
and SPI Flash was used:
The DE0-Nano provides a lot of functionality. But only the following subset was used:
- Cyclone IV (EP4CE22F17C6)
- Build in USB Blaster
- 32 MByte SDRAM
- 2 Push-button switches
- 8 Green User LEDs
- 50 MHz oscillator
The following options of the NEORV32 have been enabled:
- Bootloader
- On-Chip Debugger
- Internal Instruction memory
- Internal Data memory
- Internal Instruction Cache (ICACHE)
- External memory interface
- GPIO
- TRNG
- UART0
- Machine System Timer
- General Purpose Timer
Address Map
The settings on the NEORV32 result in the following address map:
Peripheral |
Address Offset |
Size (bytes) |
Attribute |
On-Chip Debugger |
0xFFFFFF00 |
256 |
OCD address space |
SYSINFO |
0xFFFFFE00 |
256 |
System Information Memory |
GPIO |
0xFFFFFC00 |
256 |
General Purpose Input / Output |
TRNG |
0xFFFFFA00 |
256 |
True Random Number Generator |
UART0 |
0xFFFFF500 |
256 |
Primary UART |
MTIME |
0xFFFFF400 |
256 |
Machine System Timer |
GPTMR |
0xFFFFF100 |
256 |
General Purpose Timer |
Bootloader ROM |
0xFFFFC000 |
8K |
Bootloader address space |
SDRAM |
0x90000000 |
32M |
External SDRAM |
DMEM |
0x80000000 |
16K |
On-chip Data Memory |
IMEM |
0x00000000 |
32K |
On-chip Instruction Memory |
ICACHE |
- - - - - - - - - |
16K |
On-chip Instruction Cache |
J-Flash SPI
Unfortunately the J-Link EDU lacks the license for J-Flash SPI.
If you have access to a J-Link with the right license, there is a project in the download
area (de0n-spi) that connects the SPI Flash of the "JTAG Terasic Adapter" directly
to the JTAG.
TCM / SDRAM
The examples can be executed in TCM and SDRAM memory. Debugging may only work in
SDRAM because the example could be too large for the TCM memory.
Instruction Cache (ICACHE)
The Instruction Cache can be set with the following 3 parameters:
- ICACHE_NUM_BLOCKS
- ICACHE_BLOCK_SIZE
- ICACHE_ASSOCIATIVITY
At the beginning I had no idea what the optimal values are for this. That’s
why I tried to determine the settings experimentally. For this I used the following
3 benchmarks:
- Dhrystone
- CoreMark
- Crypto
The optimal settings are of course dependent on the application you are currently
using. So the settings for A can be different than for B or C. At the beginning I
only changed the parameters NUM_BLOCKS and BLOCK_SIZE. Later I took a closer look
at the ASSOCIATIVITY parameter too.
The idea was to first run the benchmark in the TCM memory without ICACHE in order
to determine the optimum performance value. Then the benchmark was run in SDRAM,
without ICACHE and with ICACHE. Attempts were made to determine the optimal settings
for the ICACHE which show the best performance.
I started here with the default settings 4x64 and then first changed the BLOCK_SIZE
and then NUM_BLOCKS.
The benchmarks were compiled in release mode with "Link Time Optimization"
and "Level 3 for more speed".
ICACHE and Dhrystone
Running in TCM memory there is a value of 0.82 DMIPS/MHz. And in SDRAM without
ICACHE produce only a value of 0.21 DMIPS/MHz. Here is now an overview of the different
values with the ICACHE:
- 4x64 => 0.17
- 8x64 => 0.23
- 4x128 => 0.20
- 8x128 => 0.27
- 4x256 => 0.20
- 8x256 => 0.44
- 4x512 => 0.13
- 8x512 => 0.13
- 16x256 => 0.44
It looks like the optimal setting here is 8x256. Which gives a total size of 2K for
the cache. A further increase to 16x256 does not bring any more performance increase.
With the ICACHE you can achieve a little more than half the performance from the TCM
memory here in the SDRAM.
ICACHE and CoreMark
In case of CoreMark, the "Iterations/Sec" result has to be divided by the
100MHz at which the CPU is running. Only then do you get a value for CoreMark/MHz.
Running in TCM memory there is a value of 0.95 CoreMark/MHz. And in SDRAM without
ICACHE produce only a value of 0.22 CoreMark/MHz. Here is now an overview of the
different values with the ICACHE:
- 4x64 => 0.26
- 8x64 => 0.39
- 4x128 => 0.31
- 8x128 => 0.52
- 4x256 => 0.44
- 8x256 => 0.54
- 4x512 => 0.52
- 8x512 => 0.54
- 16x256 => 0.55
It looks like the optimal setting here is 8x256. Which gives a total size of 2K for
the cache. A further increase to 16x256 does not bring any noticeable increase in performance.
ICACHE and Crypto
This test consists of a Hash and ECDSA benchmark. Because of the size of the benchmark there
are only values for the SDRAM without and with cache available.
Without the cache, the values look like this:
And with the cache, the values look like this:
With the ICACHE you can double the performance here in the SDRAM.
However, a cache with a size of 16K was also used here. The problem here was finding the right
settings. First I tried to work with a setting of 16x1024 and then reducing the BLOCK_SIZE
and increasing the NUM_BLOCKS. So values from 16x1024, 32x512, 64x256 up to 256x64 was used.
It turned out that 256x64 was the optimal setting.
In the next step, I tested whether you could get even better values with ASSOCIATIVITY. And
it turned out that with 128x64x2 the values could be improve even more.
ICACHE Conclusion
After these tests here I would say that you should use a value of 64 for the BLOCK_SIZE and
try for the NUM_BLOCKS as much as possible. Furthermore, the performance can possibly be improved
again with ASSOCIATIVITY.
Whether you need a cache of 2K or 16K depends on your application and you have to test it yourself.
Download
Quartus de0n-spi_20211229
project for the direct JTAG to SPI connection (16 KB)
de0n-spi.sof
which simply has to be loaded into the FPGA with the programmer (687 KB)
The repository for this project can be found on GitHub at
neorv32-de0n-ref.
|