DE0-Nano reference
Introduction
Hardware
Address Map
J-Flash SPI
TCM / SDRAM memory
Instruction Cache (ICACHE)
ICACHE and Dhrystone
ICACHE and CoreMark
ICACHE and Crypto
Download
 

Introduction

It is not the reference but my reference. That means it includes the parts that I find important for my tests here. Others will certainly find other parts more important. But that is the nice thing about the NEORV32 project, here everyone can put their system together as they need it.

And why the DE0-Nano and not the XYZ board? Compared to my other boards, the DE0-Nano is the cheapest board, and it is easier to use the second connector if the JTAG adapter is used on the other one.

I use Quartus Prime Lite software version 20.1.1 and SEGGER Embedded Studio v8.24 as the development environment here. The NEORV32 was used in version v1.12.3.3.

Hardware

Here a DE0-Nano board with "JTAG Terasic Adapter" and SPI Flash was used:

The DE0-Nano provides a lot of functionality. But only the following subset was used:

  • Cyclone IV (EP4CE22F17C6)
  • Build in USB Blaster
  • 32 MByte SDRAM
  • 2 Push-button switches
  • 8 Green User LEDs
  • 50 MHz oscillator

The following options of the NEORV32 have been enabled:

  • Bootloader
  • On-Chip Debugger (DM)
  • Internal Instruction memory
  • Internal Data memory
  • Internal Instruction Cache (ICACHE)
  • External memory interface
  • GPIO
  • WDT
  • TRNG
  • UART0
  • Core Local Interruptor
  • General Purpose Timer

Address Map

The settings on the NEORV32 result in the following address map:

 Peripheral  Address Offset  Size (bytes)  Attribute
 DM  0xFFFF0000  64K  OCD address space
 SYSINFO  0xFFFE0000  64K  System Information Memory
 GPIO  0xFFFC0000  64K  General Purpose Input / Output
 WDT  0xFFFB0000  64K  Watchdog Timer
 TRNG  0xFFFA0000  64K  True Random Number Generator
 UART0  0xFFF50000  64K  Primary UART
 CLINT  0xFFF40000  64K  Core Local Interruptor
 GPTMR  0xFFF10000  64K  General Purpose Timer
 Bootloader ROM  0xFFE00000  64K  Bootloader address space
 SDRAM  0x90000000  32M  External SDRAM
 DMEM  0x80000000  16K  On-chip Data Memory
 IMEM  0x00000000  32K  On-chip Instruction Memory
 ICACHE  - - - - - - - - -  8K  On-chip Instruction Cache

J-Flash SPI

Unfortunately the J-Link EDU lacks the license for J-Flash SPI. If you have access to a J-Link with the right license, there is a project in the download area (de0n-spi) that connects the SPI Flash of
the "JTAG Terasic Adapter" directly to the JTAG.

TCM / SDRAM

The examples can be executed in TCM and SDRAM memory. Debugging may only work in SDRAM because the example could be too large for the TCM memory.

Instruction Cache (ICACHE)

The Instruction Cache can be set with the following 3 parameters:

  • ICACHE_EN
  • ICACHE_NUM_BLOCKS
  • CACHE_BLOCK_SIZE
  • CACHE_BURSTS_EN (must be set to false)

At the beginning I had no idea what the optimal values are for this. That’s why I tried to determine the settings experimentally. For this I used the following 3 benchmarks:

  • Dhrystone
  • CoreMark
  • Crypto

The optimal settings are of course dependent on the application you are currently using. So the settings for A can be different than for B or C. I only changed the parameters NUM_BLOCKS here.

The idea was to first run the benchmark in the TCM memory without ICACHE in order to determine the optimum performance value. Then the benchmark was run in SDRAM, without ICACHE and with ICACHE. Attempts were made to determine the optimal settings for the ICACHE which show the best performance.

I started here with the default settings 4x32 and then changed the NUM_BLOCKS.

The benchmarks were compiled in release mode with "Link Time Optimization" and "Level 3 for more speed".

ICACHE and Dhrystone

Running in TCM memory there is a value of 0.85 DMIPS/MHz. And in SDRAM without ICACHE produce only a value of 0.23 DMIPS/MHz. Here is now an overview of the different values with the ICACHE:

  •       4x32 => 0.22
  •       8x32 => 0.23
  •     16x32 => 0.32
  •     32x32 => 0.32
  •     64x32 => 0.36
  •   128x32 => 0.36
  •   256x32 => 0.54

It looks like the optimal setting here is 256x32. Which gives a total size of 8K for the cache. A further increase to 512x32 does not bring any more performance increase.

With the ICACHE you can achieve a little more than half the performance from the TCM memory here in the SDRAM.

ICACHE and CoreMark

In case of CoreMark, the "Iterations/Sec" result has to be divided by the 100MHz at which the CPU is running. Only then do you get a value for CoreMark/MHz.

Running in TCM memory there is a value of 0.86 CoreMark/MHz. And in SDRAM without ICACHE produce only a value of 0.24 CoreMark/MHz. Here is now an overview of the different values with the ICACHE:

  •       4x32 => 0.32
  •       8x32 => 0.37
  •     16x32 => 0.46
  •     32x32 => 0.52
  •     64x32 => 0.60
  •   128x32 => 0.60
  •   256x32 => 0.62

It looks like the optimal setting here is 256x32. Which gives a total size of 8K for the cache. A further increase to 512x32 does not bring any noticeable increase in performance.

ICACHE and Crypto

This test consists of a Hash and ECDSA benchmark. Because of the size of the benchmark there are only values for the SDRAM without and with cache available.

Without the cache, the values look like this:

And with the cache, the values look like this:

With the ICACHE you can double the performance here in the SDRAM.

Download

Quartus de0n-spi_20211229 project for the direct JTAG to SPI connection (16 KB)

de0n-spi.sof which simply has to be loaded into the FPGA with the programmer (687 KB)

The repository for this project can be found on GitHub at neorv32-de0n-ref.