The HPM was developed for performance
measurement of applications running on IBM systems supporting the following
processors and operating systems: Power3
and Power4 with AIX 5L and AIX 4.3.3.
The HPM consists of:
- An
utility (hpmcount),
which starts an application and provides at the end of execution wall
clock time, hardware performance counters information, derived hardware
metrics, and resource utilization statistics.
- An instrumentation
library (libhpm), which provides
instrumented programs with a summary output containing the above
information for each instrumented region in a program (resource
utilization statistics is provided only once, for the whole section of
the program that is instrumented). This library supports serial and
parallel (MPI, threaded, and mixed mode) applications, written in Fortran, C, and C++.
- A graphical user
interface (PeekPerf),
for graphical visualization of the performance file generated by libhpm.
- An
utility (hpmstat),
to collect system level hardware performance counters information.
For more information on the resource utilization
statistics, please refer to the “getrusage”
man pages.
Requirements:
o On
AIX 5L
·
The bos.pmapi, which is a product level fileset
provided with the AIX distribution, but not installed by default.
o On
AIX 4.3.3
·
The PMAPI kernel extension, available on http://www.alphaworks.ibm.com/tech/pmapi
Usage:
Sequential and shared memory programs:
> hpmcount [-o
<filename>] [-n] [-s <set>] [-g <group] [-e ev[,ev]*] <program>
MPI programs:
> poe
hpmcount [-a] [-o <filename>] [-n] [-s
<set>] [-g <group] [-e ev[,ev]*] <program>
or:
> hpmcount [-h][-c][-l]
where:
<program> is the program to be executed.
-h displays a help message.
-a aggregate counts on POE applications.
With this flag, a single file
performance file is generated for all tasks.
This flag only works with POE
and/or Load Leveller.
This
flag requires the availability of a parallel file system (e.g., GPFS) on the
system.
Notice: the program may hang if a parallel file system is not
available and this flag is set.
-c list events from all counters.
-l list
all groups (on POWER 4 systems)
-o <filename> generates an output file: <filename>.<pid>.
On parallel programs, this flags
creates one file for each process (unless the “-a” flag is set).
By default, the output goes to stdout.
-n for "no output to stdout".
This flag is only active when the
-o flag is used.
-k includes counts of system activity on behalf of the process
-e ev0,ev1,... (POWER 3)
list of
event numbers, separated by commas.
ev<i>
corresponds to event selected for counter <i>.
The event number can be obtained
by running hpmcount –c (on Power3 systems)
-g <group> (POWER 4 only).
Valid groups are from 0 to 60. The description of groups
is available in /usr/pmapi/lib/POWER4.gps. The default group is 60. Groups
considered interesting for application performance analysis are:
· 60, for counts of
cycles, instructions, and FP operations (including divides, FMA, loads, and
stores).
·
56, for counts of cycles, instructions, TLB misses, loads, stores, and
L1 misses
· 5, for counts of loads
from L2, L3, and memory.
· 58, for counts of
cycles, instructions, loads from L3, and loads from memory.
·
53, for counts of cycles, instructions, fixed-point operations, and FP
operations (includes divides, SQRT, FMA, and FMOV or FEST).
-s predefined set of events.
On Power 4
systems, -s is the same as -g.
On Power 3
systems, the available sets are:
|
Event set 1 (def.)
|
Event set 2
|
Event set 3
|
Event set 4
|
Event set 5
|
Event set 6
|
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
|
Inst. completed
|
Inst. completed
|
Loads dispatched
|
Inst. completed
|
Inst. completed
|
Inst. dispatched
|
|
TLB misses
|
TLB misses
|
L1 load misses
|
Cycles w/ 0 inst.
comp.
|
I cache misses
|
Inst. completed
|
|
Stores completed
|
Stores dispatched
|
L2 misses
|
Stores
completed
|
Branches predicted
|
I cache misses
|
|
Loads completed
|
L1 store misses
|
Stores dispatched
|
Loads completed
|
Branches completed
|
I cache hits
|
|
FPU0 ops
|
Loads dispatched
|
L2 store misses
|
FXU0 ops
|
Cond. branches
|
Cond. branches
|
|
FPU1 ops
|
L1 load misses
|
Number of write back
|
FXU1 ops
|
Br. Misspred.
|
Br. dispatched
|
|
FMAs executed
|
LSU idle
|
LSU idle
|
FXU2 ops
|
TLB misses
|
TLB misses
|
Notice that unless the flag
“-a” is provided, or the environment variable:
HPM_AGGREGATE_COUNTERS is set, a parallel programs will
generate one output for each task. Thus, if the “-o” flag is not
used, on AIX systems it is recommended that the environment variable:
MP_LABELIO be set to YES, in order to correlate each line of the output with
the corresponding task. Another option is to set the environment variable
MP_STDOUTMODE to one of the task IDs (e.g., 0), to discard output from the
other tasks. In this latter case, only the output from the selected task will
appear in stdout.
Also notice that on AIX systems,
sequential programs when compiled with the “mp” prefix (e.g., mpxlf) are MPI programs and will need to be executed as
“poe hpmcount …
program”.
Libhpm supports multiple
instrumentation sections, nested instrumentation, and each instrumented section
can be called multiple times. When nested instrumentation is used, exclusive
duration is generated for the outer sections. Average and standard deviation is
provided when an instrumented section is activated multiple times.
Libhpm supports OpenMP and threaded applications.
In this case, the thread safe version of the library (libhpm_r)
should be used. Both 32- and 64-bit applications are supported, as long all
modules are compiled in one of the two modes (32- or 64-bit).
Notice that libhpm collects information and
performs summarization during run-time. Thus, there could be a considerable
overhead if instrumentation sections are inserted inside inner loops.
Libhpm uses the same set of hardware counters
events used by hpmcount. The event set to be used can
be selected via the environment variable: HPM_EVENT_SET.
On Power 4 systems, HPM_EVENT_SET should be set to
a group from 0 to 60. The default is group 60. The description of groups is
available in /usr/pmapi/lib/POWER4.gps. Groups considered interesting for
application performance analysis are:
· 60, for counts of
cycles, instructions, and FP operations (including divides, FMA, loads, and
stores).
·
56, for counts of cycles, instructions, TLB misses, loads, stores, and
L1 misses
· 5, for counts of loads
from L2, L3, and memory.
· 58, for counts of
cycles, instructions, loads from L3, and loads from memory.
·
53, for counts of cycles, instructions, fixed-point operations, and FP
operations (includes divides, SQRT, FMA, and FMOV or FEST).
On Power 3 systems, HPM_EVENT_SET can be set to a
value between 1 and 4. The default is 1. The four event sets on the Power3 are:
|
Event set 1 (def.)
|
Event set 2
|
Event set 3
|
Event set 4
|
Event set 5
|
Event set 6
|
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
|
Inst. completed
|
Inst. completed
|
Loads dispatched
|
Inst. completed
|
Inst. completed
|
Inst. dispatched
|
|
TLB misses
|
TLB misses
|
L1 load misses
|
Cycles w/ 0 inst.
comp.
|
I cache misses
|
Inst. completed
|
|
Stores completed
|
Stores dispatched
|
L2 misses
|
Stores
completed
|
Branches predicted
|
I cache misses
|
|
Loads completed
|
L1 store misses
|
Stores dispatched
|
Loads completed
|
Branches completed
|
I cache hits
|
|
FPU0 ops
|
Loads dispatched
|
L2 store misses
|
FXU0 ops
|
Cond. branches
|
Cond. branches
|
|
FPU1 ops
|
L1 load misses
|
Number of write back
|
FXU1 ops
|
Br. Misspred.
|
Br. dispatched
|
|
FMAs executed
|
LSU idle
|
LSU idle
|
FXU2 ops
|
TLB misses
|
TLB misses
|
The following instrumentation
functions are provided:
hpmInit( taskID,
progName )
f_hpminit( taskID, progName )
- taskID is an integer value indicating the node
ID.
- progName is a string with the program name.
hpmStart(
instID, label )
f_hpmstart( instID,
label )
- instID is the instrumented section ID. It
should be > 0 and <= 100. To run a program with more than 100
instrumented sections, the user should set the environment variable
HPM_NUM_INST_PTS. In this case, instID should
be less than the value set for HPM_NUM_INST_PTS.
- Label is a string
containing a label, which is displayed by PeekPerf.
hpmStop( instID )
f_hpmstop( instID )
- for
each call to hpmStart, there should be a
corresponding call to hpmStop with matching instID. This requirement should be valid during
program execution.
hpmTstart( instID, label )
f_hpmtstart( instID,
label )
hpmTstop( instID )
f_hpmtstop( instID )
- In order to
instrument threaded applications, one should use the pair hpmTstart and hpmTstop to
start and stop the counters independently on each thread. Notice that if
two distinct threads use the same instID, the
output will indicate multiple calls. However, the counts will be
accumulated. See the Section on Multi-threaded
issues for examples
hpmGetTimeAndCounters( numCounters, time, values )
f_GetTimeAndCounters ( numCounters,
time, values )
- Every time this
function is called, it returns the time in seconds and the accumulated
counts since the call to hpmInit.
- numCounters is an integer indicating the
number of counters to be accessed.
- time is a double
precision float
- values
is a “long long” vector of size “numCounters”.
hpmGetCounters( values )
f_hpmGetCounters ( values )
- Similar to hpmGetTimeAndCounters, but only returns the total
counts since the call to hpmInit, and to
minimize intrusion and overhead, does not
perform any check on the vector size.
hpmTerminate( taskID )
f_hpmterminate( taskID
)
- This function will
generate the output. If the program exits without calling hpmTerminate, no performance information will be
generated.
A summary report for each task will
be written by default in the file: perfhpm<taskID>.<pid>.
Additionally, a set of performance files named hpm<taskID>_<progName>_<pid>.viz
will be generated to be used as input for PeekPerf.
The generation of the “.viz” file can be
avoided with the environment flag: HPM_VIZ_OUTPUT = FALSE.
Users can define the output file name with the
environment flag: HPM_OUTPUT_NAME. libhpm will still
add the extensions: _<taskID>.hpm and _<taskID>.viz for the performance files and visualization files
respectively. Using this environment flag, one can for example setup the output
file to have date and time. For example, using ksh:
MYDATE=$(date +"%Y%m%d:%H%M%S")
export HPM_OUTPUT_NAME=myprogram_$MYDATE
In this example, the output file for task 27 will
have the name: myprogram_yyyymmdd:HHMMSS_0027.hpm
Any software instrumentation is
expected to incur in some overhead. Thus, since it is not possible to eliminate
the overhead, the goal was to minimize it. In HPM, most of the
overhead is due to time measurement, which unfortunately tends to be an
expensive operation in most systems. A second source of overhead is due to
run-time accumulation and storage of performance data. Notice, that libhpm
collects information and performs summarization during run-time. Hence, there
could be a considerable overhead if instrumentation sections are inserted
inside inner loops.
Several issues were considered in
order to reduce measurement error. First, most of the library operations are
executed before starting the counters, when returning the control to the
program, or after stopping the counters, when the program calls a
“stop” function. However, even at the library level, there are a
few operations that must be executed within the counting process, as for
example, releasing a lock. Second, since timing collection and capture of
hardware counters information are two distinctive operations, the order of
these operations had to be set. Basically, it had to be decided between timing
the counters, or counting the timer. Since the cost
of timing is about one order of magnitude more expensive than the cost of
counting, the timer call precedes the PMAPI call to start the counters in the
HPM “start” function, while the first two operations executed by
the HPM “stop” function are stopping the counters, followed by
calling the timer function. Thus, there is a small error in the time
measurement, but there is minimal error in the counting process. Finally, to access
and read the counters, the library calls lower level routines from the
operating system. Hence, there are always some instructions executed by the
kernel that are accounted as part of the program. So, in order to compensate
for this measurement error, HPM uses the hardware counters during
the initialization and finalization of the library to estimate the cost of one
call to the start and stop functions. This estimated overhead is subtracted
from the values obtained on each instrumented code section. With this approach,
the error of measurement becomes close to zero. However, since this is a
statistical approximation, in some situations, this approach fails. In this
case, the following message is printed on stderr:
“WARNING: Measurement error for <event name> not removed”,
which indicates that the estimated overhead was not subtracted from the
measured values. One can deactivate the procedure that attempts to remove
measurement errors by setting the environment variable: HPM_WITH_MEASUREMENT_ERROR
to TRUE (1).
declaration:
#include “libhpm.h”
use:
hpmInit(
tasked, “my program” );
hpmStart(
1, “outer call” );
do_work();
hpmStart(
2, “computing meaning of life” );
do_more_work();
hpmStop( 2
);
hpmStop( 1
);
hpmTerminate(
taskID );
The syntax for C and C++ is the same. However, the
include files are different, since the libhpm routines must be declared as
having extern "C" linkage in C++.
Fortran
programs should call the functions with prefix “f_”. Also, notice
that the following declaration is required on all source files that have
instrumentation calls.
declaration:
#include “f_hpm.h”
use:
call f_hpminit(
taskID, “my program” )
call f_hpmstart(
1, “Do Loop” )
do …
call do_work()
call f_hpmstart(
5, “computing meaning of life” );
call do_more_work();
call f_hpmstop(
5 );
end do
call f_hpmstop(
1 )
call f_hpmterminate(
taskID )
When placing instrumentation inside of parallel
regions, one should use different ID numbers for each thread, as shown in the
following Fortran example:
!$OMP PARALLEL
!$OMP&PRIVATE (instID)
instID =
30+omp_get_thread_num()
call f_hpmtstart( instID, "computing meaning of life" )
!$OMP DO
do ...
do_work()
end do
call f_hpmtstop( instID )
!$OMP END PARALLEL
Notice that the functions hpmTstart and hpmTstop are required for threaded programs. Also, the
parameter instID should always be a variable or a
number, it cannot be an expression. This is due to the include file that
contains a set of "define" statements that are used during the
pre-processing phase that collects line numbers and file names. Finally, notice
that the library accepts the use of the same instID
for different threads. However, the counters will be accumulated for all
instances with the same instID.
In order to use libhpm, one should
add libpmapi.a, libhpm.a
(or libhpm_r.a), and liblm
to the link step:
#
HPM_DIR = <<<ENTER HPM Home
directory>>>
HPM_INC = -I$(HPM_DIR)/include
HPM_LIB = -L$(HPM_DIR)/lib -lhpm_r -lpmapi -lm
FFLAGS = -qsuffix=cpp=f
<<<Other Flags>>>
my.x : my.f
$(FF) $(HPM_INC) $(FFLAGS) my.f $(HPM_LIB) -o my.x
The flag “-qsuffix=cpp=f” is only required for the compilation of Fortran programs with extension “.f”, on AIX
systems.
The following environment flags can be used on
libhpm and hpmcount:
- HPM_EVENT_SET
- Used to select one
of the events sets on Power 3 systems, or to select a group of events on
Power 4 systems.
- On Power 3 systems,
integer between 1 and 4.
- On Power 4 systems,
integer between 0 and 60.
- HPM_DIV_WEIGHT
- Provides a weight
to be used to compute “weighted flips” on Power 4
systems.
- On Power 4 systems,
integer > 1.
In addition, users can provide estimations of
memory, cache, and TLB miss latencies for the computation of derived metrics,
with the following environment variables (please notice that not all flags are
valid in all systems):
- HPM_MEM_LATENCY
– estimated latency for a memory load.
- HPM_AVG_L3_LATENCY
– estimated average latency for an L3 load.
- HPM_L3_LATENCY
– estimated latency for an L3 load within a MCM.
- HPM_L35_LATENCY
– estimated latency for an L3 load outside of the MCM.
- HPM_AVG_L2_LATENCY
– estimated average latency for an L2 load.
- HPM_L2_LATENCY
– estimated latency for an L2 load from the processor.
- HPM_L25_LATENCY
– estimated latency for an L2 load from the same MCM.
- HPM_L275_LATENCY
– estimated latency for an L2 load from another MCM.
- HPM_TLB_LATENCY
– estimated latency for a TLB miss.
When computing derived metrics that take into
consideration estimated latencies for L2 or L3, HPM will use the
provided “average latency” only if the other latencies for the same
cache level are not provided. For example, it will only use the value set in HPM_AVG_L3_LATENCY,
if at least one of the values of HPM_ L3_LATENCY and HPM_ L35_LATENCY is not
set.
Users can also provide the estimated memory
latencies, as well as, the event set or/and a divide weight, with the file: HPM_flags.env. Each line in the file specifies one value,
which takes precedence over the corresponding environment variable. The syntax
is: <flag name> <value>.
HPM_flags.env example:
HPM_MEM_LATENCY 400
HPM_L3_LATENCY 102
HPM_L35_LATENCY 150
HPM_L2_LATENCY 12
HPM_L25_LATENCY 72
HPM_L275_LATENCY 108
HPM_TLB_LATENCY 700
HPM_EVENT_SET 5
On hpmcount the following
environment variables can also be used:
- HPM_AGGREGATE_COUNTERS
- Used to aggregate
counts on POE applications (forces the command line argument
“-a”).
- With this flag, a
single file performance file is generated for all tasks.
- This flag only
works with POE and/or Load Leveller.
- This flag requires the availability of a
parallel file system (e.g., GPFS) on the system.
- Notice: the program may hang if a parallel
file system is not available and this flag is set.
- HPM_LOG_DIR
<directory>
- When this flag is
set, hpmcount will write a file: hpm_log.<id> with the
performance data in the provided directory. This is in addition to the
regular output.
- On POE applications
<id> is a POE ID, provided by MP_PARTITION. Otherwise, it is the pid.
On libhpm the following
environment variables can also be used:
- HPM_NUM_INST_PTS
- Used to overwrite
the default of 100 instrumentation sections in the application.
- Integer value >
0
- HPM_WITH_MEASUREMENT_ERROR
- Used to deactivate
the procedure that attempts to remove measurement errors.
- True or False (0 or
1).
- HPM_VIZ_OUTPUT
- To indicate if
“.viz” file (for input to PeekPerf) should be generated or not.
- True or False (0 or
1).
- HPM_OUTPUT_NAME
- To define an output
file name different from the default.
- String
On Power 3 systems, users can also specify an event
set with the file: libHPMevents. This file
takes precedence over the environment variable. Each line in the file specifies
one event from the hardware counters. Only one event from each counter can be
used (the Power3 has 8 counters, the 604e has 4 counters). Each line should
contain:
- Counter number
(e.g., from 0 to 7 on the Power3)
- Event number (e.g.,
from 0 to 15 for counter 7 on the Power3)
- Mnemonic (e.g.,
PM_FPU0_CMPL#)
- Description (e.g.,
FPU 0 instructions#)
libHPMevents
example:
3 1 PM_CYC#
Cycles#
4 5 PM_FPU0_CMPL# FPU 0
instructions#
1 35 PM_FPU1_CMPL# FPU 1
instructions#
0 5
PM_IC_MISS# I cache misses#
2 5 PM_LD_MISS_L1# Load misses
in L1#
7 0 PM_TLB_MISS#
TLB misses#
5 5 PM_CBR_DISP#
Branches#
6 3 PM_MPRED_BR# Misspredicted branches#
There are some consistence checks for this file,
but in general, it is expected the user to know enough information regarding
the hardware counters in order to create and use this file.
PeekPerf
is an extension to hpmviz. It takes as input the
performance files (“.viz”) generated by libhpm. If the performance files are not provided in the
command line, PeekPerf will display a dialog box for
user input. Users can select a single file by left clicking on a file name, or
multiple files, by using the <Shift> or/and <Ctrl> keys. The
<Shift> key allows the selection of a range of files (from the last one
selected till the current selection), while the <Ctrl> key allows the
selection of multiple files in any order.
Usage:
> peekperf [<performance files>]
or for installations with hpmviz:
The main window of the PeekPerf
graphical user interface is divided in two panes. The left pane displays for
each instrumented section, identified by its label, the inclusive duration
(i.e., the total wall clock time executing the corresponding code region),
exclusive duration (i.e., the wall clock time of the instrumented code region,
excluding the time from inner instrumented regions), and count. The
instrumented sections are sorted by “Label”. Left clicking on any
of the columns tab will sort the data in the corresponding column. The first
click will sort in ascending order, while the second will sort in descending
order.
Right clicking on an instrumentation section brings
a “metrics” window displaying the node ID, Thread ID, count,
exclusive duration, inclusive duration, and the derived hardware metrics. This
window can be closed by typing “<Ctrl>W” or by clicking the
“Close” button. There are also two menu options in the metrics
window: Metrics Options, and Precision. The “Metrics Options” menu
brings a metrics list that allows the user to select the metrics to be
displayed. Clicking on the top of this list will make it into a “X
Windows” dialog box. The “Precision” menu allows the user to
indicate to PeekPerf the precision used when running
the program (double or single). Some values in the metrics displayed may be
highlighted with red, indicating that the metric value is below a threshold
value in a predefined range of average values for the metric. Similarly, a
number in a light gray indicate that the metric value is above a threshold
value in a predefined range of average values for the metric. Notice that some
of the predefined range depends on the precision used in the program. The
default precision assumed is "double", but the user can replace it to
"single", with the menu option described above. Any of the columns in
the metrics display can be sorted by clicking the corresponding tab. The first
click will sort the values in ascending order, while the second will sort in
descending order.
Left clicking on an instrumentation section in the
main window brings the corresponding section of the source code in the right
pane, highlighted. If the corresponding source file is not available in the
directory where PeekPerf is being executed, a dialog
box will be displayed, so the user can select the source file. On the top of
the source code pane, there are a set of tabs, one for each instrumented
module. The user can select a module to be displayed by clicking on the
corresponding tab.
The “File” menu options provided in the
main window allows one to open a new set of performance files, close the
current data, close all data, or quit PeekPerf. The
“open data”, “close data”, and “quit”
operations can also be selected with the keys <Ctrl>O, <Ctrl>C, and
<Ctrl>Q respectively. The “open” command will bring the
dialog box for the selection of the performance files.
Hpmstat is an utility for system-wide hardware performance monitoring.
It requires “root” privilege (but it can be used by non-root users
when the set-user-ID and/or set-group-ID bit is set). When activated without
command line parameters, hpmstat counts user and
kernel activity for 60 seconds and presents the raw counts and derived metrics
on stdout.
> hpmstat [-o <filename>] [-n] [-k] [-u] [-I|-U
<Interval>] [-C <Count>] [-s <set>] [-g <group>] [-e ev[,ev]*]
or:
> hpmstat [-h][-c][-l]
where:
-I <Interval> indicates the counting time
interval in seconds (default is 1 second)
-U <Interval> indicates the counting time interval in microseconds
-C <Count> Number of iterations to count (default is 1 for “-I”
and infinity for “-U”).
-k overwrites default to count system activity only
-u overwrites default to count user activity
only
-h displays a help message.
-c list events from all counters.
-l list
all groups (on POWER 4 systems)
-o <filename> generates an output file: <filename>.<pid>.
On parallel programs, this flags
creates one file for each process.
By default, the output goes to stdout.
-n for "no output to stdout".
This flag is only active when the
-o flag is used.
-e ev0,ev1,... (POWER 3 only)
list of
event numbers, separated by commas.
ev<i>
corresponds to event selected for counter <i>.
The event number can be obtained
by running hpmcount –c (on Power3 systems)
-g <group> (POWER 4 only).
Valid groups are from 0 to 60. The description of
groups is available in /usr/pmapi/lib/POWER4.gps. The default group is 60.
Groups considered interesting for application performance analysis are:
· 60, for counts of
cycles, instructions, and FP operations (including divides, FMA, loads, and
stores).
·
56, for counts of cycles, instructions, TLB misses, loads, stores, and
L1 misses
· 5, for counts of loads
from L2, L3, and memory.
· 58, for counts of
cycles, instructions, loads from L3, and loads from memory.
·
53, for counts of cycles, instructions, fixed-point operations, and FP
operations (includes divides, SQRT, FMA, and FMOV or FEST).
-s predefined set of events.
On Power 4
systems, -s is the same as -g.
On Power 3
systems, the available sets are:
|
Event set 1 (def.)
|
Event set 2
|
Event set 3
|
Event set 4
|
Event set 5
|
Event set 6
|
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
Cycles
|
|
Inst. completed
|
Inst. completed
|
Loads dispatched
|
Inst. completed
|
Inst. completed
|
Inst. dispatched
|
|
TLB misses
|
TLB misses
|
L1 load misses
|
Cycles w/ 0 inst.
comp.
|
I cache misses
|
Inst. completed
|
|
Stores completed
|
Stores dispatched
|
L2 misses
|
Stores
completed
|
Branches predicted
|
I cache misses
|
|
Loads completed
|
L1 store misses
|
Stores dispatched
|
Loads completed
|
Branches completed
|
I cache hits
|
|
FPU0 ops
|
Loads dispatched
|
L2 store misses
|
FXU0 ops
|
Cond. branches
|
Cond. branches
|
|
FPU1 ops
|
L1 load misses
|
Number of write back
|
FXU1 ops
|
Br. Misspred.
|
Br. dispatched
|
|
FMAs executed
|
LSU idle
|
LSU idle
|
FXU2 ops
|
TLB misses
|
TLB misses
|
Only the root user can
activate hpmstat.
Hpmstat uses the PMAPI system-level API, as opposed to libhpm and hpmcount, which use the thread-level API. Because the
system-level APIs would report bogus data if the thread-level API is in use,
system-level API calls are not allowed at the same time as thread-level API
calls. Thus, the allocation of a thread context will take the system-level API
lock, which will not be released until the last context has been deallocated. Hence, hpmstat
counts will not be accurate if a program instrumented with libhpm or hpmcount is activated within the window of time that hpmstat is active.
In addition to presenting the raw
counter data, HPM also computes derive metrics, depending on the
hardware events that were selected to be counted. The following derived metrics
are supported (please notice that not all metrics are supported on all
systems):
- Total time in user
mode (User time):
User time = Cycles / Processor frequency
User time / Wall clock time
Instructions completed / Cycles
0.000001 * Instructions completed / Wall clock time
- Instructions per I
Cache Miss:
Instructions completed / Instructions cache misses
- Percentage of
instructions dispatched that completed:
100 * Instructions completed / Instructions dispatched
- Load and store
operations (Total LS):
Total LS = Loads + Stores
- Instructions per
load/store:
Instructions completed / Total LS
- Average number of
loads per load miss:
Loads / Load misses in L1
- Average number of L1
load misses per L2 load miss:
Load misses in L1 / Load misses in L2
- Average number of
stores per store miss:
Stores / Store misses in L1
- Average number of
loads per TLB miss:
Loads / TLB misses
- Average number of
load/store per TLB miss:
Total LS / TLB misses
- Average number of
load/stores per L1 miss:
Total LS / (Load misses in L1 + Store misses in L1)
- Average number of
load/stores per L2 miss:
Total LS / (Load misses in L2 + Store misses in L2)
100 * ( 1 - ( (Load misses in L1 +
Store misses in L1) / Total LS )
100 * ( 1 - ( (Load misses in L2 +
Store misses in L2) / ( Total L1 Misses)
Power3: (L2 misses + Write backs) * Cache Line Size /
(1024 * 1024)
Power4: Data loaded from memory * 128 / (1024 * 1024)
Memory traffic / Wall clock time
100 * Snoop hit occurred / Snoop requests
- Hardware float
point instructions per cycle:
( FPU 0 + FPU 1 ) / Cycles
- Hardware float
point instructions / user time:
( FPU 0 + FPU 1 ) / User time
- Float point
instructions plus FMA ( flip ):
Power3: flip = FPU 0 instructions + FPU 1
instructions + FMAs executed
Power4: flip = FPU 0 instructions + FPU 1
instructions + FMAs executed – FPU Stores
- Float point
instructions plus FMAs rate (Mflip/sec):
0.000001 * flip / Wall clock time
0.000001 * flip / User time
- Weighted float
point instructions (wflip):
wflip =
flip + (HPM_DIV_WEIGHT – 1) * Divides
- Weighted float
point instructions rate (M Wflip/s):
M Wflip/s = 0.000001 * wflip / Wall clock time
flip / Total LS
100 * FMAs executed * 2 / flip
- Fixed point
instructions:
FXU 0 instructions + FXU 1 Instructions + FXU 2 Instructions
- Fixed point
operations per Cycle:
Fixed point instructions / Cycles
- Fixed point operations
per Load Stores:
Fixed point instructions / Total LS
- Branches Misspredicted percentage:
100 * Branches Misspredicted /
Branches
- Percentage of TLB
misses per cycle:
100 * TLB Misses / Cycle
- Estimated latency
from TLB misses:
User estimated TLB Miss latency *
TLB Misses / Processor frequency
- Power 4 specific
metrics (note that latencies are obtained via user input with environment
flags):
§ Percentage
of loads from memory per cycle:
100 * Total loads from memory / Cycles
§ Estimated
latency from loads from memory:
Memory latency * loads from memory / Processor frequency
§ Total
loads from L3 (L3 loads):
Data loaded from L3 + Data loaded from L3.5
§ L3
traffic:
L3 loads * Cache Line Size / (1024 *
1024)
§ L3
bandwidth:
L3 traffic / wall clock time
§ L3
Load miss rate:
Total loads from memory /
(total loads from L3 + total loads from memory)
§ Percentage
of L3 loads per cycle:
100 * L3 loads / Cycle
§ Estimated
latency from loads from L3:
(L3 latency * L3 data
loads) + (L3.5
latency * L3.5 data loads) / Processor frequency
or
Average L3 latency * Total
loads from L3 / Processor frequency
§ Total
loads from L2 (L2 loads):
Sum (data loaded from (L2, L2.5(shared),
L2.5(mod), L2.75(shared), and L2.75(mod))).
§ L2
traffic:
L2 loads * Cache Line Size / (1024 *
1024)
§ L2
bandwidth:
L2 traffic / wall clock time
§ L2
Load miss rate:
(loads
from memory + L3 loads) / (L2 loads + L3 loads + loads from memory)
§ Percentage
of L2 loads per cycle:
100 * L2 loads / Cycle
§ Estimated
latency from loads from L2:
(L2 lat. * L2 loads) + (L2.5 lat. * L2.5
loads) + (L2.75 lat. * L2.75 loads) / Processor frequency
or
Average L2 latency * Total
loads from L2 / Processor frequency
§
Percentage of cycles LSU is idle:
100 * LSU idle / Cycles
§
Percentage of cycles with zero instructions completed:
100 * Zero instructions completed / Cycles
§ Average number
of loads per L2 miss:
Loads / Master generated load op not retried
§ Average number
of stores per L2 miss:
Loads / Master generated load op not retried
Top of page