Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 1 | |
| 2 | Performance Counters for Linux |
| 3 | ------------------------------ |
| 4 | |
| 5 | Performance counters are special hardware registers available on most modern |
| 6 | CPUs. These registers count the number of certain types of hw events: such |
| 7 | as instructions executed, cachemisses suffered, or branches mis-predicted - |
| 8 | without slowing down the kernel or applications. These registers can also |
| 9 | trigger interrupts when a threshold number of events have passed - and can |
| 10 | thus be used to profile the code that runs on that CPU. |
| 11 | |
| 12 | The Linux Performance Counter subsystem provides an abstraction of these |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 13 | hardware capabilities. It provides per task and per CPU counters, counter |
| 14 | groups, and it provides event capabilities on top of those. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 15 | |
| 16 | Performance counters are accessed via special file descriptors. |
| 17 | There's one file descriptor per virtual counter used. |
| 18 | |
| 19 | The special file descriptor is opened via the perf_counter_open() |
| 20 | system call: |
| 21 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 22 | int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr, |
| 23 | pid_t pid, int cpu, int group_fd); |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 24 | |
| 25 | The syscall returns the new fd. The fd can be used via the normal |
| 26 | VFS system calls: read() can be used to read the counter, fcntl() |
| 27 | can be used to set the blocking mode, etc. |
| 28 | |
| 29 | Multiple counters can be kept open at a time, and the counters |
| 30 | can be poll()ed. |
| 31 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 32 | When creating a new counter fd, 'perf_counter_hw_event' is: |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 33 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 34 | /* |
| 35 | * Hardware event to monitor via a performance monitoring counter: |
| 36 | */ |
| 37 | struct perf_counter_hw_event { |
| 38 | s64 type; |
| 39 | |
| 40 | u64 irq_period; |
| 41 | u32 record_type; |
| 42 | |
| 43 | u32 disabled : 1, /* off by default */ |
| 44 | nmi : 1, /* NMI sampling */ |
| 45 | raw : 1, /* raw event type */ |
| 46 | __reserved_1 : 29; |
| 47 | |
| 48 | u64 __reserved_2; |
| 49 | }; |
| 50 | |
| 51 | /* |
| 52 | * Generalized performance counter event types, used by the hw_event.type |
| 53 | * parameter of the sys_perf_counter_open() syscall: |
| 54 | */ |
| 55 | enum hw_event_types { |
| 56 | /* |
| 57 | * Common hardware events, generalized by the kernel: |
| 58 | */ |
| 59 | PERF_COUNT_CYCLES = 0, |
| 60 | PERF_COUNT_INSTRUCTIONS = 1, |
| 61 | PERF_COUNT_CACHE_REFERENCES = 2, |
| 62 | PERF_COUNT_CACHE_MISSES = 3, |
| 63 | PERF_COUNT_BRANCH_INSTRUCTIONS = 4, |
| 64 | PERF_COUNT_BRANCH_MISSES = 5, |
| 65 | |
| 66 | /* |
| 67 | * Special "software" counters provided by the kernel, even if |
| 68 | * the hardware does not support performance counters. These |
| 69 | * counters measure various physical and sw events of the |
| 70 | * kernel (and allow the profiling of them as well): |
| 71 | */ |
| 72 | PERF_COUNT_CPU_CLOCK = -1, |
| 73 | PERF_COUNT_TASK_CLOCK = -2, |
| 74 | /* |
| 75 | * Future software events: |
| 76 | */ |
| 77 | /* PERF_COUNT_PAGE_FAULTS = -3, |
| 78 | PERF_COUNT_CONTEXT_SWITCHES = -4, */ |
| 79 | }; |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 80 | |
| 81 | These are standardized types of events that work uniformly on all CPUs |
| 82 | that implements Performance Counters support under Linux. If a CPU is |
| 83 | not able to count branch-misses, then the system call will return |
| 84 | -EINVAL. |
| 85 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 86 | More hw_event_types are supported as well, but they are CPU |
| 87 | specific and are enumerated via /sys on a per CPU basis. Raw hw event |
| 88 | types can be passed in under hw_event.type if hw_event.raw is 1. |
| 89 | For example, to count "External bus cycles while bus lock signal asserted" |
| 90 | events on Intel Core CPUs, pass in a 0x4064 event type value and set |
| 91 | hw_event.raw to 1. |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 92 | |
| 93 | 'record_type' is the type of data that a read() will provide for the |
| 94 | counter, and it can be one of: |
| 95 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 96 | /* |
| 97 | * IRQ-notification data record type: |
| 98 | */ |
| 99 | enum perf_counter_record_type { |
| 100 | PERF_RECORD_SIMPLE = 0, |
| 101 | PERF_RECORD_IRQ = 1, |
| 102 | PERF_RECORD_GROUP = 2, |
| 103 | }; |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 104 | |
| 105 | a "simple" counter is one that counts hardware events and allows |
| 106 | them to be read out into a u64 count value. (read() returns 8 on |
| 107 | a successful read of a simple counter.) |
| 108 | |
| 109 | An "irq" counter is one that will also provide an IRQ context information: |
| 110 | the IP of the interrupted context. In this case read() will return |
| 111 | the 8-byte counter value, plus the Instruction Pointer address of the |
| 112 | interrupted context. |
| 113 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 114 | The parameter 'hw_event_period' is the number of events before waking up |
| 115 | a read() that is blocked on a counter fd. Zero value means a non-blocking |
| 116 | counter. |
| 117 | |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 118 | The 'pid' parameter allows the counter to be specific to a task: |
| 119 | |
| 120 | pid == 0: if the pid parameter is zero, the counter is attached to the |
| 121 | current task. |
| 122 | |
| 123 | pid > 0: the counter is attached to a specific task (if the current task |
| 124 | has sufficient privilege to do so) |
| 125 | |
| 126 | pid < 0: all tasks are counted (per cpu counters) |
| 127 | |
| 128 | The 'cpu' parameter allows a counter to be made specific to a full |
| 129 | CPU: |
| 130 | |
| 131 | cpu >= 0: the counter is restricted to a specific CPU |
| 132 | cpu == -1: the counter counts on all CPUs |
| 133 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 134 | (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) |
Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 135 | |
| 136 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts |
| 137 | events of that task and 'follows' that task to whatever CPU the task |
| 138 | gets schedule to. Per task counters can be created by any user, for |
| 139 | their own tasks. |
| 140 | |
| 141 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts |
| 142 | all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. |
| 143 | |
Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame^] | 144 | Group counters are created by passing in a group_fd of another counter. |
| 145 | Groups are scheduled at once and can be used with PERF_RECORD_GROUP |
| 146 | to record multi-dimensional timestamps. |
| 147 | |