Given two systems, both with a Cortex-A5 CPU, one clocked at 396MHz without L2 cache and one clocked at 500 MHz with 512kB L2 cache. How big is the impact of the L2 cache? Since the clock frequency is different, a simple CPU time comparison of a given program does not answer the question… I tried to answer this question using perf. perf is often used to profile software, but in this case it also proved to be useful to compare two different hardware implementations.
Most CPU’s nowadays have internal counters which count various events (e.g. executed instructions, cache misses, executed branches and branch misses etc…). Other hardware, e.g. cache controllers, might expose performance counters too, but this article focuses on the hardware counters exposed by the CPU.
The Linux perf_events subsystem exports these counters to user space. Beside the hardware events, there are also various software events which are counted (e.g. context switches). The perf utility is a user space application which makes use of the perf_events interface of the Linux kernel. The building block for most perf commands are the available event types, which are listed by the perf list command.
The two mentioned system configurations can be found on the Freescale Vybrid based Toradex Colibri VF50 and VF61 modules. As of now, the perf list output looks rather sparse:
root@colibri-vf:~# perf list List of pre-defined events (to be used in -e): alignment-faults [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor] (see 'man perf-list' on how to encode it) mem:[/len][:access] [Hardware breakpoint]
The CPU counters are missing… According to the Cortex-A5 Technical Reference Manual the CPU should support various counters. Digging a bit into it showed two prerequisites: The config option CONFIG_HW_PERF_EVENTS (which makes sure the architecture dependent driver arch/arm/kernel/perf_event_v7.c gets compiled into the kernel) and an appropriate device tree node. With that, this in place, perf list shows the CPU counters:
# perf list List of pre-defined events (to be used in -e): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] alignment-faults [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-dcache-prefetch-misses [Hardware cache event] L1-dcache-prefetches [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-icache-load-misses [Hardware cache event] L1-icache-loads [Hardware cache event] L1-icache-prefetch-misses [Hardware cache event] L1-icache-prefetches [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-store-misses [Hardware cache event] iTLB-load-misses [Hardware cache event] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor] (see 'man perf-list' on how to encode it) mem:[/len][:access] [Hardware breakpoint]
Unfortunately, the listed events by itself leave some questions unanswered: What means cache-misses exactly? The global cache misses (accounting misses which miss both cache?) or only a local cache misses? The source file above and combined with the information from the ARM Cortex-A5 TRM give somewhat more insight. This list shows the mapping between the above perf events and the effective hardware counters they represent:
|perf Event||PMU/Reg||ARM Cortex-A5 TRM Description|
|branches||0x0c||Software change of the PC (according to source all taken branches)|
|branch-misses||0x10||Mispredicted or not predicted branch speculatively executed|
|cache-misses||0x03||Level 1 data cache refill|
|cache-references||0x04||Level 1 data cache access|
|cycles||PMCCNTR||Counts processor clock cycles|
|instructions||0x08||Instruction architecturally executed|
|L1-dcache-load-misses||0x03||Level 1 data cache refill|
|L1-dcache-loads||0x04||Level 1 data cache access|
|L1-dcache-prefetch-misses||0xc3||Prefetch linefill dropped|
|L1-dcache-prefetches||0xc2||Linefill because of prefetch|
|L1-dcache-store-misses||0x03||Level 1 data cache refill|
|L1-dcache-stores||0x04||Level 1 data cache access|
|L1-icache-load-misses||0x01||Level 1 instruction cache refill|
|L1-icache-loads||0x14||Level 1 instruction cache access|
|L1-icache-prefetch-misses||0xc3||Prefetch linefill dropped|
|L1-icache-prefetches||0xc2||Linefill because of prefetch|
|branch-load-misses||0x10||Mispredicted or not predicted branch speculatively executed|
|branch-loads||0x12||Predictable branch speculatively executed|
|dTLB-load-misses||0x05||Level 1 data TLB refill|
|dTLB-store-misses||0x05||Level 1 data TLB refill|
|iTLB-load-misses||0x02||Level 1 instruction TLB refill|
Since the perf_events interface is architecture independent, not all information of the ARM PMU map perfectly to perf events. Several counters appear twice, and some are misleading (e.g. L1-dcache-prefetches are actually D$ and I$ prefetches). The PMU of other ARMv7 CPUs expose a similar amount of events, often mapping as shown above, but some events might be different, hence YMMV! Another interesting fact is that the Cortex-A5 supports two hardware counters. The CPU cycles counter is a separate register and hence comes “for free”.
Ok, now lets compare an application between the two above mentioned systems. I choose four events: task-clock, cycles, instructions and branch-misses. This selection makes sure that we have a dedicated PMU hardware counter for the hardware counted events instructions and branch-misses (again, cycles do not need a PMU hardware counter since they are exposed as a separate register, and task-clock is a software event). My test application uses the cairo framework to draw 1000 rectangles on a framebuffer device. One can argue whether that load is representative, but I do not want to go down that road!
The first system under test is the Colibri VF50 with the 396MHz clocked CPU and without L2 cache. Here is a sample output:
# perf stat -e task-clock,cycles,instructions,branch-misses ./cairo Performance counter stats for './cairo': 4346.477415 task-clock (msec) # 0.138 CPUs utilized 1705864718 cycles # 0.392 GHz 292200141 instructions # 0.17 insns per cycle 920911 branch-misses # 0.212 M/sec
And the same application on the Colibri VF61 with the 500MHz clocked CPU and 512kB L2 cache:
# perf stat -e task-clock,cycles,instructions,branch-misses ./cairo Performance counter stats for './cairo': 2851.861460 task-clock (msec) # 0.124 CPUs utilized 1418321454 cycles # 0.497 GHz 290460474 instructions # 0.20 insns per cycle 860675 branch-misses # 0.302 M/sec
I made a “warm-up” run and then 5 measurements on each system. perf calculates the CPU frequency using the (software based) task clock and the cycles (behind the # sign of the cycles row). The values match pretty close to the actual CPU frequencies, which tells that the hardware counts and software task-clock seem to be accurate. All counters were stable across the 5 measurements, with standard deviations below 1% of the mean value.
Since the two systems use the very same CPU core, the difference in clock cycles required to execute the program is likely to be attributed to the L2 cache misses. Another measurement showed that the L1 cache misses are in the same order for the two systems. Unfortunately it is not possible to measure the cache miss penalties, which would be likely quite different between the two systems…
Assuming we can attribute the instruction count difference to the L2 cache, we can answer the initial question: The speedup between VF50 and VF61 attributed to the L2 cache itself is 1.20 (calculated using clocks per instruction). The overall speedup which includes the higher CPU frequency is 1.53 (calculated using the execution times).
Note: When using more hardware events then hardware counters available, the subsystem uses the counters in turns and interpolates the values. This can lead to values which are way off, especially if the sample rate is low (exposed through /proc/sys/kernel/perf_event_max_sample_rate). The subsystem lowers the sample rate on slow systems automatically, in my case after using the perf top command. With a sample frequency of 200 and while using more then two event counters, I got some interesting values:
... 878094 branch-misses # 191.03% of all branches (99.60%) 459661 branches (78.82%) ... ... 23028913 cycles # 0.005 GHz (82.95%) 293729707 instructions # 12.75 insns per cycle (99.77%) ...
More branch misses than branches! And 12.74 instructions per cycle on a single-issue in-order Cortex-A5, impressive 😉