Benchmark

Synthesis Result

MODULES & LOOPS LATENCY (NS) INTERVAL BRAM DSP FF LUT
rmsnorm_wrapper 4.050E4 4,051 16 386 37,421 44,542

Data Flow

image.png

Timeline Trace

image.png

Problems

  1. memcpy copies one element at a time
  2. Functions are not connected
  3. sum_of_squares cannot be pipelined due to data dependency

⇨ Removing pragmas results in similar performance