| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 1 |  | 
|  | 2 | Performance Counters for Linux | 
|  | 3 | ------------------------------ | 
|  | 4 |  | 
|  | 5 | Performance counters are special hardware registers available on most modern | 
|  | 6 | CPUs. These registers count the number of certain types of hw events: such | 
|  | 7 | as instructions executed, cachemisses suffered, or branches mis-predicted - | 
|  | 8 | without slowing down the kernel or applications. These registers can also | 
|  | 9 | trigger interrupts when a threshold number of events have passed - and can | 
|  | 10 | thus be used to profile the code that runs on that CPU. | 
|  | 11 |  | 
|  | 12 | The Linux Performance Counter subsystem provides an abstraction of these | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 13 | hardware capabilities. It provides per task and per CPU counters, counter | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 14 | groups, and it provides event capabilities on top of those.  It | 
|  | 15 | provides "virtual" 64-bit counters, regardless of the width of the | 
|  | 16 | underlying hardware counters. | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 17 |  | 
|  | 18 | Performance counters are accessed via special file descriptors. | 
|  | 19 | There's one file descriptor per virtual counter used. | 
|  | 20 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 21 | The special file descriptor is opened via the perf_event_open() | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 22 | system call: | 
|  | 23 |  | 
| Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 24 | int sys_perf_event_open(struct perf_event_attr *hw_event_uptr, | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 25 | pid_t pid, int cpu, int group_fd, | 
|  | 26 | unsigned long flags); | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 27 |  | 
|  | 28 | The syscall returns the new fd. The fd can be used via the normal | 
|  | 29 | VFS system calls: read() can be used to read the counter, fcntl() | 
|  | 30 | can be used to set the blocking mode, etc. | 
|  | 31 |  | 
|  | 32 | Multiple counters can be kept open at a time, and the counters | 
|  | 33 | can be poll()ed. | 
|  | 34 |  | 
| Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 35 | When creating a new counter fd, 'perf_event_attr' is: | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 36 |  | 
| Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 37 | struct perf_event_attr { | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 38 | /* | 
|  | 39 | * The MSB of the config word signifies if the rest contains cpu | 
|  | 40 | * specific (raw) counter configuration data, if unset, the next | 
|  | 41 | * 7 bits are an event type and the rest of the bits are the event | 
|  | 42 | * identifier. | 
|  | 43 | */ | 
|  | 44 | __u64                   config; | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 45 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 46 | __u64                   irq_period; | 
|  | 47 | __u32                   record_type; | 
|  | 48 | __u32                   read_format; | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 49 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 50 | __u64                   disabled       :  1, /* off by default        */ | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 51 | inherit        :  1, /* children inherit it   */ | 
|  | 52 | pinned         :  1, /* must always be on PMU */ | 
|  | 53 | exclusive      :  1, /* only group on PMU     */ | 
|  | 54 | exclude_user   :  1, /* don't count user      */ | 
|  | 55 | exclude_kernel :  1, /* ditto kernel          */ | 
|  | 56 | exclude_hv     :  1, /* ditto hypervisor      */ | 
|  | 57 | exclude_idle   :  1, /* don't count when idle */ | 
|  | 58 | mmap           :  1, /* include mmap data     */ | 
|  | 59 | munmap         :  1, /* include munmap data   */ | 
|  | 60 | comm           :  1, /* include comm data     */ | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 61 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 62 | __reserved_1   : 52; | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 63 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 64 | __u32                   extra_config_len; | 
|  | 65 | __u32                   wakeup_events;  /* wakeup every n events */ | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 66 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 67 | __u64                   __reserved_2; | 
|  | 68 | __u64                   __reserved_3; | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 69 | }; | 
|  | 70 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 71 | The 'config' field specifies what the counter should count.  It | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 72 | is divided into 3 bit-fields: | 
|  | 73 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 74 | raw_type: 1 bit   (most significant bit)	0x8000_0000_0000_0000 | 
|  | 75 | type:	  7 bits  (next most significant)	0x7f00_0000_0000_0000 | 
|  | 76 | event_id: 56 bits (least significant)		0x00ff_ffff_ffff_ffff | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 77 |  | 
|  | 78 | If 'raw_type' is 1, then the counter will count a hardware event | 
|  | 79 | specified by the remaining 63 bits of event_config.  The encoding is | 
|  | 80 | machine-specific. | 
|  | 81 |  | 
|  | 82 | If 'raw_type' is 0, then the 'type' field says what kind of counter | 
|  | 83 | this is, with the following encoding: | 
|  | 84 |  | 
|  | 85 | enum perf_event_types { | 
|  | 86 | PERF_TYPE_HARDWARE		= 0, | 
|  | 87 | PERF_TYPE_SOFTWARE		= 1, | 
|  | 88 | PERF_TYPE_TRACEPOINT		= 2, | 
|  | 89 | }; | 
|  | 90 |  | 
|  | 91 | A counter of PERF_TYPE_HARDWARE will count the hardware event | 
|  | 92 | specified by 'event_id': | 
|  | 93 |  | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 94 | /* | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 95 | * Generalized performance counter event types, used by the hw_event.event_id | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 96 | * parameter of the sys_perf_event_open() syscall: | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 97 | */ | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 98 | enum hw_event_ids { | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 99 | /* | 
|  | 100 | * Common hardware events, generalized by the kernel: | 
|  | 101 | */ | 
| Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 102 | PERF_COUNT_HW_CPU_CYCLES		= 0, | 
|  | 103 | PERF_COUNT_HW_INSTRUCTIONS		= 1, | 
| Kirill Smelkov | 0895cf0 | 2010-01-13 13:22:18 -0200 | [diff] [blame] | 104 | PERF_COUNT_HW_CACHE_REFERENCES		= 2, | 
| Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 105 | PERF_COUNT_HW_CACHE_MISSES		= 3, | 
|  | 106 | PERF_COUNT_HW_BRANCH_INSTRUCTIONS	= 4, | 
| Kirill Smelkov | 0895cf0 | 2010-01-13 13:22:18 -0200 | [diff] [blame] | 107 | PERF_COUNT_HW_BRANCH_MISSES		= 5, | 
| Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 108 | PERF_COUNT_HW_BUS_CYCLES		= 6, | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 109 | }; | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 110 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 111 | These are standardized types of events that work relatively uniformly | 
|  | 112 | on all CPUs that implement Performance Counters support under Linux, | 
|  | 113 | although there may be variations (e.g., different CPUs might count | 
|  | 114 | cache references and misses at different levels of the cache hierarchy). | 
|  | 115 | If a CPU is not able to count the selected event, then the system call | 
|  | 116 | will return -EINVAL. | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 117 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 118 | More hw_event_types are supported as well, but they are CPU-specific | 
|  | 119 | and accessed as raw events.  For example, to count "External bus | 
|  | 120 | cycles while bus lock signal asserted" events on Intel Core CPUs, pass | 
|  | 121 | in a 0x4064 event_id value and set hw_event.raw_type to 1. | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 122 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 123 | A counter of type PERF_TYPE_SOFTWARE will count one of the available | 
|  | 124 | software events, selected by 'event_id': | 
|  | 125 |  | 
|  | 126 | /* | 
|  | 127 | * Special "software" counters provided by the kernel, even if the hardware | 
|  | 128 | * does not support performance counters. These counters measure various | 
|  | 129 | * physical and sw events of the kernel (and allow the profiling of them as | 
|  | 130 | * well): | 
|  | 131 | */ | 
|  | 132 | enum sw_event_ids { | 
| Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 133 | PERF_COUNT_SW_CPU_CLOCK		= 0, | 
| Kirill Smelkov | 0895cf0 | 2010-01-13 13:22:18 -0200 | [diff] [blame] | 134 | PERF_COUNT_SW_TASK_CLOCK	= 1, | 
|  | 135 | PERF_COUNT_SW_PAGE_FAULTS	= 2, | 
| Peter Zijlstra | f4dbfa8 | 2009-06-11 14:06:28 +0200 | [diff] [blame] | 136 | PERF_COUNT_SW_CONTEXT_SWITCHES	= 3, | 
|  | 137 | PERF_COUNT_SW_CPU_MIGRATIONS	= 4, | 
|  | 138 | PERF_COUNT_SW_PAGE_FAULTS_MIN	= 5, | 
|  | 139 | PERF_COUNT_SW_PAGE_FAULTS_MAJ	= 6, | 
| Anton Blanchard | f7d7986 | 2009-10-18 01:09:29 +0000 | [diff] [blame] | 140 | PERF_COUNT_SW_ALIGNMENT_FAULTS	= 7, | 
|  | 141 | PERF_COUNT_SW_EMULATION_FAULTS	= 8, | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 142 | }; | 
|  | 143 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 144 | Counters of the type PERF_TYPE_TRACEPOINT are available when the ftrace event | 
|  | 145 | tracer is available, and event_id values can be obtained from | 
|  | 146 | /debug/tracing/events/*/*/id | 
|  | 147 |  | 
|  | 148 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 149 | Counters come in two flavours: counting counters and sampling | 
|  | 150 | counters.  A "counting" counter is one that is used for counting the | 
|  | 151 | number of events that occur, and is characterised by having | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 152 | irq_period = 0. | 
|  | 153 |  | 
|  | 154 |  | 
|  | 155 | A read() on a counter returns the current value of the counter and possible | 
|  | 156 | additional values as specified by 'read_format', each value is a u64 (8 bytes) | 
|  | 157 | in size. | 
|  | 158 |  | 
|  | 159 | /* | 
|  | 160 | * Bits that can be set in hw_event.read_format to request that | 
|  | 161 | * reads on the counter should return the indicated quantities, | 
|  | 162 | * in increasing order of bit value, after the counter value. | 
|  | 163 | */ | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 164 | enum perf_event_read_format { | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 165 | PERF_FORMAT_TOTAL_TIME_ENABLED  =  1, | 
|  | 166 | PERF_FORMAT_TOTAL_TIME_RUNNING  =  2, | 
|  | 167 | }; | 
|  | 168 |  | 
|  | 169 | Using these additional values one can establish the overcommit ratio for a | 
|  | 170 | particular counter allowing one to take the round-robin scheduling effect | 
|  | 171 | into account. | 
|  | 172 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 173 |  | 
|  | 174 | A "sampling" counter is one that is set up to generate an interrupt | 
|  | 175 | every N events, where N is given by 'irq_period'.  A sampling counter | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 176 | has irq_period > 0. The record_type controls what data is recorded on each | 
|  | 177 | interrupt: | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 178 |  | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 179 | /* | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 180 | * Bits that can be set in hw_event.record_type to request information | 
|  | 181 | * in the overflow packets. | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 182 | */ | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 183 | enum perf_event_record_format { | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 184 | PERF_RECORD_IP          = 1U << 0, | 
|  | 185 | PERF_RECORD_TID         = 1U << 1, | 
|  | 186 | PERF_RECORD_TIME        = 1U << 2, | 
|  | 187 | PERF_RECORD_ADDR        = 1U << 3, | 
|  | 188 | PERF_RECORD_GROUP       = 1U << 4, | 
|  | 189 | PERF_RECORD_CALLCHAIN   = 1U << 5, | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 190 | }; | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 191 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 192 | Such (and other) events will be recorded in a ring-buffer, which is | 
|  | 193 | available to user-space using mmap() (see below). | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 194 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 195 | The 'disabled' bit specifies whether the counter starts out disabled | 
|  | 196 | or enabled.  If it is initially disabled, it can be enabled by ioctl | 
|  | 197 | or prctl (see below). | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 198 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 199 | The 'inherit' bit, if set, specifies that this counter should count | 
|  | 200 | events on descendant tasks as well as the task specified.  This only | 
|  | 201 | applies to new descendents, not to any existing descendents at the | 
|  | 202 | time the counter is created (nor to any new descendents of existing | 
|  | 203 | descendents). | 
|  | 204 |  | 
|  | 205 | The 'pinned' bit, if set, specifies that the counter should always be | 
|  | 206 | on the CPU if at all possible.  It only applies to hardware counters | 
|  | 207 | and only to group leaders.  If a pinned counter cannot be put onto the | 
|  | 208 | CPU (e.g. because there are not enough hardware counters or because of | 
|  | 209 | a conflict with some other event), then the counter goes into an | 
|  | 210 | 'error' state, where reads return end-of-file (i.e. read() returns 0) | 
|  | 211 | until the counter is subsequently enabled or disabled. | 
|  | 212 |  | 
|  | 213 | The 'exclusive' bit, if set, specifies that when this counter's group | 
|  | 214 | is on the CPU, it should be the only group using the CPU's counters. | 
|  | 215 | In future, this will allow sophisticated monitoring programs to supply | 
|  | 216 | extra configuration information via 'extra_config_len' to exploit | 
|  | 217 | advanced features of the CPU's Performance Monitor Unit (PMU) that are | 
|  | 218 | not otherwise accessible and that might disrupt other hardware | 
|  | 219 | counters. | 
|  | 220 |  | 
|  | 221 | The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a | 
|  | 222 | way to request that counting of events be restricted to times when the | 
|  | 223 | CPU is in user, kernel and/or hypervisor mode. | 
|  | 224 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 225 | The 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap | 
|  | 226 | operations, these can be used to relate userspace IP addresses to actual | 
|  | 227 | code, even after the mapping (or even the whole process) is gone, | 
|  | 228 | these events are recorded in the ring-buffer (see below). | 
|  | 229 |  | 
|  | 230 | The 'comm' bit allows tracking of process comm data on process creation. | 
|  | 231 | This too is recorded in the ring-buffer (see below). | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 232 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 233 | The 'pid' parameter to the perf_event_open() system call allows the | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 234 | counter to be specific to a task: | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 235 |  | 
|  | 236 | pid == 0: if the pid parameter is zero, the counter is attached to the | 
|  | 237 | current task. | 
|  | 238 |  | 
|  | 239 | pid > 0: the counter is attached to a specific task (if the current task | 
|  | 240 | has sufficient privilege to do so) | 
|  | 241 |  | 
|  | 242 | pid < 0: all tasks are counted (per cpu counters) | 
|  | 243 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 244 | The 'cpu' parameter allows a counter to be made specific to a CPU: | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 245 |  | 
|  | 246 | cpu >= 0: the counter is restricted to a specific CPU | 
|  | 247 | cpu == -1: the counter counts on all CPUs | 
|  | 248 |  | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 249 | (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) | 
| Ingo Molnar | e7bc62b | 2008-12-04 20:13:45 +0100 | [diff] [blame] | 250 |  | 
|  | 251 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts | 
|  | 252 | events of that task and 'follows' that task to whatever CPU the task | 
|  | 253 | gets schedule to. Per task counters can be created by any user, for | 
|  | 254 | their own tasks. | 
|  | 255 |  | 
|  | 256 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts | 
|  | 257 | all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege. | 
|  | 258 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 259 | The 'flags' parameter is currently unused and must be zero. | 
|  | 260 |  | 
|  | 261 | The 'group_fd' parameter allows counter "groups" to be set up.  A | 
|  | 262 | counter group has one counter which is the group "leader".  The leader | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 263 | is created first, with group_fd = -1 in the perf_event_open call | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 264 | that creates it.  The rest of the group members are created | 
|  | 265 | subsequently, with group_fd giving the fd of the group leader. | 
|  | 266 | (A single counter on its own is created with group_fd = -1 and is | 
|  | 267 | considered to be a group with only 1 member.) | 
|  | 268 |  | 
|  | 269 | A counter group is scheduled onto the CPU as a unit, that is, it will | 
|  | 270 | only be put onto the CPU if all of the counters in the group can be | 
|  | 271 | put onto the CPU.  This means that the values of the member counters | 
|  | 272 | can be meaningfully compared, added, divided (to get ratios), etc., | 
|  | 273 | with each other, since they have counted events for the same set of | 
|  | 274 | executed instructions. | 
|  | 275 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 276 |  | 
|  | 277 | Like stated, asynchronous events, like counter overflow or PROT_EXEC mmap | 
|  | 278 | tracking are logged into a ring-buffer. This ring-buffer is created and | 
|  | 279 | accessed through mmap(). | 
|  | 280 |  | 
|  | 281 | The mmap size should be 1+2^n pages, where the first page is a meta-data page | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 282 | (struct perf_event_mmap_page) that contains various bits of information such | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 283 | as where the ring-buffer head is. | 
|  | 284 |  | 
|  | 285 | /* | 
|  | 286 | * Structure of the page that can be mapped via mmap | 
|  | 287 | */ | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 288 | struct perf_event_mmap_page { | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 289 | __u32   version;                /* version number of this structure */ | 
|  | 290 | __u32   compat_version;         /* lowest version this is compat with */ | 
|  | 291 |  | 
|  | 292 | /* | 
|  | 293 | * Bits needed to read the hw counters in user-space. | 
|  | 294 | * | 
|  | 295 | *   u32 seq; | 
|  | 296 | *   s64 count; | 
|  | 297 | * | 
|  | 298 | *   do { | 
|  | 299 | *     seq = pc->lock; | 
|  | 300 | * | 
|  | 301 | *     barrier() | 
|  | 302 | *     if (pc->index) { | 
|  | 303 | *       count = pmc_read(pc->index - 1); | 
|  | 304 | *       count += pc->offset; | 
|  | 305 | *     } else | 
|  | 306 | *       goto regular_read; | 
|  | 307 | * | 
|  | 308 | *     barrier(); | 
|  | 309 | *   } while (pc->lock != seq); | 
|  | 310 | * | 
|  | 311 | * NOTE: for obvious reason this only works on self-monitoring | 
|  | 312 | *       processes. | 
|  | 313 | */ | 
|  | 314 | __u32   lock;                   /* seqlock for synchronization */ | 
|  | 315 | __u32   index;                  /* hardware counter identifier */ | 
|  | 316 | __s64   offset;                 /* add to hardware counter value */ | 
|  | 317 |  | 
|  | 318 | /* | 
|  | 319 | * Control data for the mmap() data buffer. | 
|  | 320 | * | 
|  | 321 | * User-space reading this value should issue an rmb(), on SMP capable | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 322 | * platforms, after reading this value -- see perf_event_wakeup(). | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 323 | */ | 
|  | 324 | __u32   data_head;              /* head in the data section */ | 
|  | 325 | }; | 
|  | 326 |  | 
|  | 327 | NOTE: the hw-counter userspace bits are arch specific and are currently only | 
|  | 328 | implemented on powerpc. | 
|  | 329 |  | 
|  | 330 | The following 2^n pages are the ring-buffer which contains events of the form: | 
|  | 331 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 332 | #define PERF_RECORD_MISC_KERNEL          (1 << 0) | 
|  | 333 | #define PERF_RECORD_MISC_USER            (1 << 1) | 
|  | 334 | #define PERF_RECORD_MISC_OVERFLOW        (1 << 2) | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 335 |  | 
|  | 336 | struct perf_event_header { | 
|  | 337 | __u32   type; | 
|  | 338 | __u16   misc; | 
|  | 339 | __u16   size; | 
|  | 340 | }; | 
|  | 341 |  | 
|  | 342 | enum perf_event_type { | 
|  | 343 |  | 
|  | 344 | /* | 
|  | 345 | * The MMAP events record the PROT_EXEC mappings so that we can | 
|  | 346 | * correlate userspace IPs to code. They have the following structure: | 
|  | 347 | * | 
|  | 348 | * struct { | 
|  | 349 | *      struct perf_event_header        header; | 
|  | 350 | * | 
|  | 351 | *      u32                             pid, tid; | 
|  | 352 | *      u64                             addr; | 
|  | 353 | *      u64                             len; | 
|  | 354 | *      u64                             pgoff; | 
|  | 355 | *      char                            filename[]; | 
|  | 356 | * }; | 
|  | 357 | */ | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 358 | PERF_RECORD_MMAP                 = 1, | 
|  | 359 | PERF_RECORD_MUNMAP               = 2, | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 360 |  | 
|  | 361 | /* | 
|  | 362 | * struct { | 
|  | 363 | *      struct perf_event_header        header; | 
|  | 364 | * | 
|  | 365 | *      u32                             pid, tid; | 
|  | 366 | *      char                            comm[]; | 
|  | 367 | * }; | 
|  | 368 | */ | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 369 | PERF_RECORD_COMM                 = 3, | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 370 |  | 
|  | 371 | /* | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 372 | * When header.misc & PERF_RECORD_MISC_OVERFLOW the event_type field | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 373 | * will be PERF_RECORD_* | 
|  | 374 | * | 
|  | 375 | * struct { | 
|  | 376 | *      struct perf_event_header        header; | 
|  | 377 | * | 
|  | 378 | *      { u64                   ip;       } && PERF_RECORD_IP | 
|  | 379 | *      { u32                   pid, tid; } && PERF_RECORD_TID | 
|  | 380 | *      { u64                   time;     } && PERF_RECORD_TIME | 
|  | 381 | *      { u64                   addr;     } && PERF_RECORD_ADDR | 
|  | 382 | * | 
|  | 383 | *      { u64                   nr; | 
|  | 384 | *        { u64 event, val; }   cnt[nr];  } && PERF_RECORD_GROUP | 
|  | 385 | * | 
|  | 386 | *      { u16                   nr, | 
|  | 387 | *                              hv, | 
|  | 388 | *                              kernel, | 
|  | 389 | *                              user; | 
|  | 390 | *        u64                   ips[nr];  } && PERF_RECORD_CALLCHAIN | 
|  | 391 | * }; | 
|  | 392 | */ | 
|  | 393 | }; | 
|  | 394 |  | 
|  | 395 | NOTE: PERF_RECORD_CALLCHAIN is arch specific and currently only implemented | 
|  | 396 | on x86. | 
|  | 397 |  | 
|  | 398 | Notification of new events is possible through poll()/select()/epoll() and | 
|  | 399 | fcntl() managing signals. | 
|  | 400 |  | 
|  | 401 | Normally a notification is generated for every page filled, however one can | 
| Tim Blechmann | 0b413e4 | 2009-12-27 14:43:06 +0100 | [diff] [blame] | 402 | additionally set perf_event_attr.wakeup_events to generate one every | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 403 | so many counter overflow events. | 
|  | 404 |  | 
|  | 405 | Future work will include a splice() interface to the ring-buffer. | 
|  | 406 |  | 
|  | 407 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 408 | Counters can be enabled and disabled in two ways: via ioctl and via | 
|  | 409 | prctl.  When a counter is disabled, it doesn't count or generate | 
|  | 410 | events but does continue to exist and maintain its count value. | 
|  | 411 |  | 
|  | 412 | An individual counter or counter group can be enabled with | 
|  | 413 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 414 | ioctl(fd, PERF_EVENT_IOC_ENABLE); | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 415 |  | 
|  | 416 | or disabled with | 
|  | 417 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 418 | ioctl(fd, PERF_EVENT_IOC_DISABLE); | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 419 |  | 
|  | 420 | Enabling or disabling the leader of a group enables or disables the | 
|  | 421 | whole group; that is, while the group leader is disabled, none of the | 
|  | 422 | counters in the group will count.  Enabling or disabling a member of a | 
|  | 423 | group other than the leader only affects that counter - disabling an | 
|  | 424 | non-leader stops that counter from counting but doesn't affect any | 
|  | 425 | other counter. | 
|  | 426 |  | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 427 | Additionally, non-inherited overflow counters can use | 
|  | 428 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 429 | ioctl(fd, PERF_EVENT_IOC_REFRESH, nr); | 
| Peter Zijlstra | e5791a8 | 2009-05-01 12:23:19 +0200 | [diff] [blame] | 430 |  | 
|  | 431 | to enable a counter for 'nr' events, after which it gets disabled again. | 
|  | 432 |  | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 433 | A process can enable or disable all the counter groups that are | 
|  | 434 | attached to it, using prctl: | 
|  | 435 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 436 | prctl(PR_TASK_PERF_EVENTS_ENABLE); | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 437 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 438 | prctl(PR_TASK_PERF_EVENTS_DISABLE); | 
| Paul Mackerras | f66c6b2 | 2009-03-23 10:29:36 +1100 | [diff] [blame] | 439 |  | 
|  | 440 | This applies to all counters on the current process, whether created | 
|  | 441 | by this process or by another, and doesn't affect any counters that | 
|  | 442 | this process has created on other processes.  It only enables or | 
|  | 443 | disables the group leaders, not any other members in the groups. | 
| Ingo Molnar | 447557a | 2008-12-11 20:40:18 +0100 | [diff] [blame] | 444 |  | 
| Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 445 |  | 
|  | 446 | Arch requirements | 
|  | 447 | ----------------- | 
|  | 448 |  | 
|  | 449 | If your architecture does not have hardware performance metrics, you can | 
|  | 450 | still use the generic software counters based on hrtimers for sampling. | 
|  | 451 |  | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 452 | So to start with, in order to add HAVE_PERF_EVENTS to your Kconfig, you | 
| Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 453 | will need at least this: | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 454 | - asm/perf_event.h - a basic stub will suffice at first | 
| Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 455 | - support for atomic64 types (and associated helper functions) | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 456 | - set_perf_event_pending() implemented | 
| Mike Frysinger | 018df72 | 2009-06-12 13:17:43 -0400 | [diff] [blame] | 457 |  | 
|  | 458 | If your architecture does have hardware capabilities, you can override the | 
| Ingo Molnar | cdd6c48 | 2009-09-21 12:02:48 +0200 | [diff] [blame] | 459 | weak stub hw_perf_event_init() to register hardware counters. | 
| Peter Zijlstra | 906010b | 2009-09-21 16:08:49 +0200 | [diff] [blame] | 460 |  | 
|  | 461 | Architectures that have d-cache aliassing issues, such as Sparc and ARM, | 
|  | 462 | should select PERF_USE_VMALLOC in order to avoid these for perf mmap(). |