A measure of how a given piece of code or the entire application is performing is to look at the average number of cycles that are needed to retire an instruction. This is an indication of how much latency is in the system and can be a valuable measure of how an application is performing.
Cycles per instruction (CPI) is actually a ratio of two values. The numerator is the number of cpu cycles uses divided by the number of instructions executed. To compare how one version of a part of the code is running to another version, since this is a ratio, it is important to keep one of the values constant in order to understand if the optimization is working. If more cpu cycles are being used, but more instructions are being executed, then the ratio could be the same, but this measure will not show any improvement. The goal is to lower the CPI in certain parts of the code as well as the overall application.
[clickToTweet tweet=”Measure your Cycle Per Instruction #cpi” quote=”Read more about how to compute your Cycles Per Instruction.”]
Computing the CPI for a thread is straightforward and can be calculated by counting the number of time or cycles it takes to retire an instruction. For the per core case, all of the threads running on a hardware core must be aggregated to arrive at the proper ratio. For example, if a certain part of the code takes 1200 cycles and executes 600 instructions, then the CPI would be 1200/600 = 2. However, a core in this case should have a CPI equal to 0.5, this means that not enough work is being sent to the core, as only ¼ of the capacity is being used. For example, if more threads can be added to the workload of a core, then the CPI will decrease to the maximum that a core can handle. Using this CPI metric is a very valuable metric to understand what is happening at the thread and core level.
By understanding this metric, guidelines can be recommended that will help the developer to maximize the work per hardware element. On an Intel Xeon Phi processor, there are 72 cores that can each have 4 threads running simultaneously. If the CPI is not sufficiently low enough, more work should be added to the core. As a general rule for the Intel Xeon Phi processor, if there are <= 36 threads, put each thread on a tile. If there are between 37 and 73 threads, use one thread per core. And if there are between 73 and less than 145 threads, assign 2 threads per core. Of course this will depend on the work that each thread must perform, to reduce the CPI.
While there are many optimization techniques available to HPC applications it is always useful to understand what is happening down at the core and thread level and attempt to maximize the workload per thread or core. New knowledge about the application can be gained by looking at how hard the threads and cores are working and then assigning the work accordingly.
Download your free 30-day trial of Intel® Parallel Studio XE