Benchmark
- HLSTransform Code
- Testbench Code
- Resource & Performance [C Synthesis]
Synthesis Result
MODULES & LOOPS |
LATENCY (NS) |
INTERVAL |
BRAM |
DSP |
FF |
LUT |
rmsnorm_wrapper |
4.050E4 |
4,051 |
16 |
386 |
37,421 |
44,542 |
Data Flow

Timeline Trace

Problems
- memcpy copies one element at a time
- With a 512-bit memory bandwidth, it should be able to read 16 elements per cycle
- Burst read is required to utilize the bandwidth efficiently
- Functions are not connected
- Dataflow is needed to enable concurrent execution and proper function chaining
- sum_of_squares cannot be pipelined due to data dependency
- Pipeline stalls occur because of loop-carried dependencies
- Removing pragmas results in similar performance
⇨ Removing pragmas results in similar performance