|
Introduction
It is not the reference but my reference. That means it includes the parts that I find
important for my tests here. Others will certainly find other parts more important. But
that is the nice thing about the NEORV32
project, here everyone can put their system together as they need it.
And why the DE0-Nano and not the XYZ board? Compared to my
other boards, the DE0-Nano is the
cheapest board, and it is easier to use the second connector if the JTAG adapter
is used on the other one.
I use Quartus Prime Lite software version 20.1.1 and SEGGER Embedded Studio v8.24 as the
development environment here. The NEORV32 was used in version v1.12.3.3.
Hardware
Here a DE0-Nano board with "JTAG Terasic Adapter"
and SPI Flash was used:
The DE0-Nano provides a lot of functionality. But only the following subset was used:
- Cyclone IV (EP4CE22F17C6)
- Build in USB Blaster
- 32 MByte SDRAM
- 2 Push-button switches
- 8 Green User LEDs
- 50 MHz oscillator
The following options of the NEORV32 have been enabled:
- Bootloader
- On-Chip Debugger (DM)
- Internal Instruction memory
- Internal Data memory
- Internal Instruction Cache (ICACHE)
- External memory interface
- GPIO
- WDT
- TRNG
- UART0
- Core Local Interruptor
- General Purpose Timer
Address Map
The settings on the NEORV32 result in the following address map:
| Peripheral |
Address Offset |
Size (bytes) |
Attribute |
| DM |
0xFFFF0000 |
64K |
OCD address space |
| SYSINFO |
0xFFFE0000 |
64K |
System Information Memory |
| GPIO |
0xFFFC0000 |
64K |
General Purpose Input / Output |
| WDT |
0xFFFB0000 |
64K |
Watchdog Timer |
| TRNG |
0xFFFA0000 |
64K |
True Random Number Generator |
| UART0 |
0xFFF50000 |
64K |
Primary UART |
| CLINT |
0xFFF40000 |
64K |
Core Local Interruptor |
| GPTMR |
0xFFF10000 |
64K |
General Purpose Timer |
| Bootloader ROM |
0xFFE00000 |
64K |
Bootloader address space |
| SDRAM |
0x90000000 |
32M |
External SDRAM |
| DMEM |
0x80000000 |
16K |
On-chip Data Memory |
| IMEM |
0x00000000 |
32K |
On-chip Instruction Memory |
| ICACHE |
- - - - - - - - - |
8K |
On-chip Instruction Cache |
J-Flash SPI
Unfortunately the J-Link EDU lacks the license for J-Flash SPI.
If you have access to a J-Link with the right license, there is a project in the download
area (de0n-spi) that connects the SPI Flash of the "JTAG Terasic Adapter" directly
to the JTAG.
TCM / SDRAM
The examples can be executed in TCM and SDRAM memory. Debugging may only work in
SDRAM because the example could be too large for the TCM memory.
Instruction Cache (ICACHE)
The Instruction Cache can be set with the following 3 parameters:
- ICACHE_EN
- ICACHE_NUM_BLOCKS
- CACHE_BLOCK_SIZE
- CACHE_BURSTS_EN (must be set to false)
At the beginning I had no idea what the optimal values are for this. That’s
why I tried to determine the settings experimentally. For this I used the following
3 benchmarks:
- Dhrystone
- CoreMark
- Crypto
The optimal settings are of course dependent on the application you are currently
using. So the settings for A can be different than for B or C. I only changed the
parameters NUM_BLOCKS here.
The idea was to first run the benchmark in the TCM memory without ICACHE in order
to determine the optimum performance value. Then the benchmark was run in SDRAM,
without ICACHE and with ICACHE. Attempts were made to determine the optimal settings
for the ICACHE which show the best performance.
I started here with the default settings 4x32 and then changed the NUM_BLOCKS.
The benchmarks were compiled in release mode with "Link Time Optimization"
and "Level 3 for more speed".
ICACHE and Dhrystone
Running in TCM memory there is a value of 0.85 DMIPS/MHz. And in SDRAM without
ICACHE produce only a value of 0.23 DMIPS/MHz. Here is now an overview of the different
values with the ICACHE:
- 4x32 => 0.22
- 8x32 => 0.23
- 16x32 => 0.32
- 32x32 => 0.32
- 64x32 => 0.36
- 128x32 => 0.36
- 256x32 => 0.54
It looks like the optimal setting here is 256x32. Which gives a total size of 8K for
the cache. A further increase to 512x32 does not bring any more performance increase.
With the ICACHE you can achieve a little more than half the performance from the TCM
memory here in the SDRAM.
ICACHE and CoreMark
In case of CoreMark, the "Iterations/Sec" result has to be divided by the
100MHz at which the CPU is running. Only then do you get a value for CoreMark/MHz.
Running in TCM memory there is a value of 0.86 CoreMark/MHz. And in SDRAM without
ICACHE produce only a value of 0.24 CoreMark/MHz. Here is now an overview of the
different values with the ICACHE:
- 4x32 => 0.32
- 8x32 => 0.37
- 16x32 => 0.46
- 32x32 => 0.52
- 64x32 => 0.60
- 128x32 => 0.60
- 256x32 => 0.62
It looks like the optimal setting here is 256x32. Which gives a total size of 8K for
the cache. A further increase to 512x32 does not bring any noticeable increase in performance.
ICACHE and Crypto
This test consists of a Hash and ECDSA benchmark. Because of the size of the benchmark there
are only values for the SDRAM without and with cache available.
Without the cache, the values look like this:
And with the cache, the values look like this:
With the ICACHE you can double the performance here in the SDRAM.
Download
Quartus de0n-spi_20211229
project for the direct JTAG to SPI connection (16 KB)
de0n-spi.sof
which simply has to be loaded into the FPGA with the programmer (687 KB)
The repository for this project can be found on GitHub at
neorv32-de0n-ref.
|