| Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 1 |  | 
 | 2 | relayfs - a high-speed data relay filesystem | 
 | 3 | ============================================ | 
 | 4 |  | 
 | 5 | relayfs is a filesystem designed to provide an efficient mechanism for | 
 | 6 | tools and facilities to relay large and potentially sustained streams | 
 | 7 | of data from kernel space to user space. | 
 | 8 |  | 
 | 9 | The main abstraction of relayfs is the 'channel'.  A channel consists | 
 | 10 | of a set of per-cpu kernel buffers each represented by a file in the | 
 | 11 | relayfs filesystem.  Kernel clients write into a channel using | 
 | 12 | efficient write functions which automatically log to the current cpu's | 
 | 13 | channel buffer.  User space applications mmap() the per-cpu files and | 
 | 14 | retrieve the data as it becomes available. | 
 | 15 |  | 
 | 16 | The format of the data logged into the channel buffers is completely | 
 | 17 | up to the relayfs client; relayfs does however provide hooks which | 
| Marcelo Tosatti | afeda2c | 2005-09-16 19:28:01 -0700 | [diff] [blame] | 18 | allow clients to impose some structure on the buffer data.  Nor does | 
| Tom Zanussi | e82894f | 2005-09-06 15:16:30 -0700 | [diff] [blame] | 19 | relayfs implement any form of data filtering - this also is left to | 
 | 20 | the client.  The purpose is to keep relayfs as simple as possible. | 
 | 21 |  | 
 | 22 | This document provides an overview of the relayfs API.  The details of | 
 | 23 | the function parameters are documented along with the functions in the | 
 | 24 | filesystem code - please see that for details. | 
 | 25 |  | 
 | 26 | Semantics | 
 | 27 | ========= | 
 | 28 |  | 
 | 29 | Each relayfs channel has one buffer per CPU, each buffer has one or | 
 | 30 | more sub-buffers. Messages are written to the first sub-buffer until | 
 | 31 | it is too full to contain a new message, in which case it it is | 
 | 32 | written to the next (if available).  Messages are never split across | 
 | 33 | sub-buffers.  At this point, userspace can be notified so it empties | 
 | 34 | the first sub-buffer, while the kernel continues writing to the next. | 
 | 35 |  | 
 | 36 | When notified that a sub-buffer is full, the kernel knows how many | 
 | 37 | bytes of it are padding i.e. unused.  Userspace can use this knowledge | 
 | 38 | to copy only valid data. | 
 | 39 |  | 
 | 40 | After copying it, userspace can notify the kernel that a sub-buffer | 
 | 41 | has been consumed. | 
 | 42 |  | 
 | 43 | relayfs can operate in a mode where it will overwrite data not yet | 
 | 44 | collected by userspace, and not wait for it to consume it. | 
 | 45 |  | 
 | 46 | relayfs itself does not provide for communication of such data between | 
 | 47 | userspace and kernel, allowing the kernel side to remain simple and not | 
 | 48 | impose a single interface on userspace. It does provide a separate | 
 | 49 | helper though, described below. | 
 | 50 |  | 
 | 51 | klog, relay-app & librelay | 
 | 52 | ========================== | 
 | 53 |  | 
 | 54 | relayfs itself is ready to use, but to make things easier, two | 
 | 55 | additional systems are provided.  klog is a simple wrapper to make | 
 | 56 | writing formatted text or raw data to a channel simpler, regardless of | 
 | 57 | whether a channel to write into exists or not, or whether relayfs is | 
 | 58 | compiled into the kernel or is configured as a module.  relay-app is | 
 | 59 | the kernel counterpart of userspace librelay.c, combined these two | 
 | 60 | files provide glue to easily stream data to disk, without having to | 
 | 61 | bother with housekeeping.  klog and relay-app can be used together, | 
 | 62 | with klog providing high-level logging functions to the kernel and | 
 | 63 | relay-app taking care of kernel-user control and disk-logging chores. | 
 | 64 |  | 
 | 65 | It is possible to use relayfs without relay-app & librelay, but you'll | 
 | 66 | have to implement communication between userspace and kernel, allowing | 
 | 67 | both to convey the state of buffers (full, empty, amount of padding). | 
 | 68 |  | 
 | 69 | klog, relay-app and librelay can be found in the relay-apps tarball on | 
 | 70 | http://relayfs.sourceforge.net | 
 | 71 |  | 
 | 72 | The relayfs user space API | 
 | 73 | ========================== | 
 | 74 |  | 
 | 75 | relayfs implements basic file operations for user space access to | 
 | 76 | relayfs channel buffer data.  Here are the file operations that are | 
 | 77 | available and some comments regarding their behavior: | 
 | 78 |  | 
 | 79 | open()	 enables user to open an _existing_ buffer. | 
 | 80 |  | 
 | 81 | mmap()	 results in channel buffer being mapped into the caller's | 
 | 82 | 	 memory space. Note that you can't do a partial mmap - you must | 
 | 83 | 	 map the entire file, which is NRBUF * SUBBUFSIZE. | 
 | 84 |  | 
 | 85 | read()	 read the contents of a channel buffer.  The bytes read are | 
 | 86 | 	 'consumed' by the reader i.e. they won't be available again | 
 | 87 | 	 to subsequent reads.  If the channel is being used in | 
 | 88 | 	 no-overwrite mode (the default), it can be read at any time | 
 | 89 | 	 even if there's an active kernel writer.  If the channel is | 
 | 90 | 	 being used in overwrite mode and there are active channel | 
 | 91 | 	 writers, results may be unpredictable - users should make | 
 | 92 | 	 sure that all logging to the channel has ended before using | 
 | 93 | 	 read() with overwrite mode. | 
 | 94 |  | 
 | 95 | poll()	 POLLIN/POLLRDNORM/POLLERR supported.  User applications are | 
 | 96 | 	 notified when sub-buffer boundaries are crossed. | 
 | 97 |  | 
 | 98 | close() decrements the channel buffer's refcount.  When the refcount | 
 | 99 | 	reaches 0 i.e. when no process or kernel client has the buffer | 
 | 100 | 	open, the channel buffer is freed. | 
 | 101 |  | 
 | 102 |  | 
 | 103 | In order for a user application to make use of relayfs files, the | 
 | 104 | relayfs filesystem must be mounted.  For example, | 
 | 105 |  | 
 | 106 | 	mount -t relayfs relayfs /mnt/relay | 
 | 107 |  | 
 | 108 | NOTE:	relayfs doesn't need to be mounted for kernel clients to create | 
 | 109 | 	or use channels - it only needs to be mounted when user space | 
 | 110 | 	applications need access to the buffer data. | 
 | 111 |  | 
 | 112 |  | 
 | 113 | The relayfs kernel API | 
 | 114 | ====================== | 
 | 115 |  | 
 | 116 | Here's a summary of the API relayfs provides to in-kernel clients: | 
 | 117 |  | 
 | 118 |  | 
 | 119 |   channel management functions: | 
 | 120 |  | 
 | 121 |     relay_open(base_filename, parent, subbuf_size, n_subbufs, | 
 | 122 |                callbacks) | 
 | 123 |     relay_close(chan) | 
 | 124 |     relay_flush(chan) | 
 | 125 |     relay_reset(chan) | 
 | 126 |     relayfs_create_dir(name, parent) | 
 | 127 |     relayfs_remove_dir(dentry) | 
 | 128 |  | 
 | 129 |   channel management typically called on instigation of userspace: | 
 | 130 |  | 
 | 131 |     relay_subbufs_consumed(chan, cpu, subbufs_consumed) | 
 | 132 |  | 
 | 133 |   write functions: | 
 | 134 |  | 
 | 135 |     relay_write(chan, data, length) | 
 | 136 |     __relay_write(chan, data, length) | 
 | 137 |     relay_reserve(chan, length) | 
 | 138 |  | 
 | 139 |   callbacks: | 
 | 140 |  | 
 | 141 |     subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | 
 | 142 |     buf_mapped(buf, filp) | 
 | 143 |     buf_unmapped(buf, filp) | 
 | 144 |  | 
 | 145 |   helper functions: | 
 | 146 |  | 
 | 147 |     relay_buf_full(buf) | 
 | 148 |     subbuf_start_reserve(buf, length) | 
 | 149 |  | 
 | 150 |  | 
 | 151 | Creating a channel | 
 | 152 | ------------------ | 
 | 153 |  | 
 | 154 | relay_open() is used to create a channel, along with its per-cpu | 
 | 155 | channel buffers.  Each channel buffer will have an associated file | 
 | 156 | created for it in the relayfs filesystem, which can be opened and | 
 | 157 | mmapped from user space if desired.  The files are named | 
 | 158 | basename0...basenameN-1 where N is the number of online cpus, and by | 
 | 159 | default will be created in the root of the filesystem.  If you want a | 
 | 160 | directory structure to contain your relayfs files, you can create it | 
 | 161 | with relayfs_create_dir() and pass the parent directory to | 
 | 162 | relay_open().  Clients are responsible for cleaning up any directory | 
 | 163 | structure they create when the channel is closed - use | 
 | 164 | relayfs_remove_dir() for that. | 
 | 165 |  | 
 | 166 | The total size of each per-cpu buffer is calculated by multiplying the | 
 | 167 | number of sub-buffers by the sub-buffer size passed into relay_open(). | 
 | 168 | The idea behind sub-buffers is that they're basically an extension of | 
 | 169 | double-buffering to N buffers, and they also allow applications to | 
 | 170 | easily implement random-access-on-buffer-boundary schemes, which can | 
 | 171 | be important for some high-volume applications.  The number and size | 
 | 172 | of sub-buffers is completely dependent on the application and even for | 
 | 173 | the same application, different conditions will warrant different | 
 | 174 | values for these parameters at different times.  Typically, the right | 
 | 175 | values to use are best decided after some experimentation; in general, | 
 | 176 | though, it's safe to assume that having only 1 sub-buffer is a bad | 
 | 177 | idea - you're guaranteed to either overwrite data or lose events | 
 | 178 | depending on the channel mode being used. | 
 | 179 |  | 
 | 180 | Channel 'modes' | 
 | 181 | --------------- | 
 | 182 |  | 
 | 183 | relayfs channels can be used in either of two modes - 'overwrite' or | 
 | 184 | 'no-overwrite'.  The mode is entirely determined by the implementation | 
 | 185 | of the subbuf_start() callback, as described below.  In 'overwrite' | 
 | 186 | mode, also known as 'flight recorder' mode, writes continuously cycle | 
 | 187 | around the buffer and will never fail, but will unconditionally | 
 | 188 | overwrite old data regardless of whether it's actually been consumed. | 
 | 189 | In no-overwrite mode, writes will fail i.e. data will be lost, if the | 
 | 190 | number of unconsumed sub-buffers equals the total number of | 
 | 191 | sub-buffers in the channel.  It should be clear that if there is no | 
 | 192 | consumer or if the consumer can't consume sub-buffers fast enought, | 
 | 193 | data will be lost in either case; the only difference is whether data | 
 | 194 | is lost from the beginning or the end of a buffer. | 
 | 195 |  | 
 | 196 | As explained above, a relayfs channel is made of up one or more | 
 | 197 | per-cpu channel buffers, each implemented as a circular buffer | 
 | 198 | subdivided into one or more sub-buffers.  Messages are written into | 
 | 199 | the current sub-buffer of the channel's current per-cpu buffer via the | 
 | 200 | write functions described below.  Whenever a message can't fit into | 
 | 201 | the current sub-buffer, because there's no room left for it, the | 
 | 202 | client is notified via the subbuf_start() callback that a switch to a | 
 | 203 | new sub-buffer is about to occur.  The client uses this callback to 1) | 
 | 204 | initialize the next sub-buffer if appropriate 2) finalize the previous | 
 | 205 | sub-buffer if appropriate and 3) return a boolean value indicating | 
 | 206 | whether or not to actually go ahead with the sub-buffer switch. | 
 | 207 |  | 
 | 208 | To implement 'no-overwrite' mode, the userspace client would provide | 
 | 209 | an implementation of the subbuf_start() callback something like the | 
 | 210 | following: | 
 | 211 |  | 
 | 212 | static int subbuf_start(struct rchan_buf *buf, | 
 | 213 |                         void *subbuf, | 
 | 214 | 			void *prev_subbuf, | 
 | 215 | 			unsigned int prev_padding) | 
 | 216 | { | 
 | 217 | 	if (prev_subbuf) | 
 | 218 | 		*((unsigned *)prev_subbuf) = prev_padding; | 
 | 219 |  | 
 | 220 | 	if (relay_buf_full(buf)) | 
 | 221 | 		return 0; | 
 | 222 |  | 
 | 223 | 	subbuf_start_reserve(buf, sizeof(unsigned int)); | 
 | 224 |  | 
 | 225 | 	return 1; | 
 | 226 | } | 
 | 227 |  | 
 | 228 | If the current buffer is full i.e. all sub-buffers remain unconsumed, | 
 | 229 | the callback returns 0 to indicate that the buffer switch should not | 
 | 230 | occur yet i.e. until the consumer has had a chance to read the current | 
 | 231 | set of ready sub-buffers.  For the relay_buf_full() function to make | 
 | 232 | sense, the consumer is reponsible for notifying relayfs when | 
 | 233 | sub-buffers have been consumed via relay_subbufs_consumed().  Any | 
 | 234 | subsequent attempts to write into the buffer will again invoke the | 
 | 235 | subbuf_start() callback with the same parameters; only when the | 
 | 236 | consumer has consumed one or more of the ready sub-buffers will | 
 | 237 | relay_buf_full() return 0, in which case the buffer switch can | 
 | 238 | continue. | 
 | 239 |  | 
 | 240 | The implementation of the subbuf_start() callback for 'overwrite' mode | 
 | 241 | would be very similar: | 
 | 242 |  | 
 | 243 | static int subbuf_start(struct rchan_buf *buf, | 
 | 244 |                         void *subbuf, | 
 | 245 | 			void *prev_subbuf, | 
 | 246 | 			unsigned int prev_padding) | 
 | 247 | { | 
 | 248 | 	if (prev_subbuf) | 
 | 249 | 		*((unsigned *)prev_subbuf) = prev_padding; | 
 | 250 |  | 
 | 251 | 	subbuf_start_reserve(buf, sizeof(unsigned int)); | 
 | 252 |  | 
 | 253 | 	return 1; | 
 | 254 | } | 
 | 255 |  | 
 | 256 | In this case, the relay_buf_full() check is meaningless and the | 
 | 257 | callback always returns 1, causing the buffer switch to occur | 
 | 258 | unconditionally.  It's also meaningless for the client to use the | 
 | 259 | relay_subbufs_consumed() function in this mode, as it's never | 
 | 260 | consulted. | 
 | 261 |  | 
 | 262 | The default subbuf_start() implementation, used if the client doesn't | 
 | 263 | define any callbacks, or doesn't define the subbuf_start() callback, | 
 | 264 | implements the simplest possible 'no-overwrite' mode i.e. it does | 
 | 265 | nothing but return 0. | 
 | 266 |  | 
 | 267 | Header information can be reserved at the beginning of each sub-buffer | 
 | 268 | by calling the subbuf_start_reserve() helper function from within the | 
 | 269 | subbuf_start() callback.  This reserved area can be used to store | 
 | 270 | whatever information the client wants.  In the example above, room is | 
 | 271 | reserved in each sub-buffer to store the padding count for that | 
 | 272 | sub-buffer.  This is filled in for the previous sub-buffer in the | 
 | 273 | subbuf_start() implementation; the padding value for the previous | 
 | 274 | sub-buffer is passed into the subbuf_start() callback along with a | 
 | 275 | pointer to the previous sub-buffer, since the padding value isn't | 
 | 276 | known until a sub-buffer is filled.  The subbuf_start() callback is | 
 | 277 | also called for the first sub-buffer when the channel is opened, to | 
 | 278 | give the client a chance to reserve space in it.  In this case the | 
 | 279 | previous sub-buffer pointer passed into the callback will be NULL, so | 
 | 280 | the client should check the value of the prev_subbuf pointer before | 
 | 281 | writing into the previous sub-buffer. | 
 | 282 |  | 
 | 283 | Writing to a channel | 
 | 284 | -------------------- | 
 | 285 |  | 
 | 286 | kernel clients write data into the current cpu's channel buffer using | 
 | 287 | relay_write() or __relay_write().  relay_write() is the main logging | 
 | 288 | function - it uses local_irqsave() to protect the buffer and should be | 
 | 289 | used if you might be logging from interrupt context.  If you know | 
 | 290 | you'll never be logging from interrupt context, you can use | 
 | 291 | __relay_write(), which only disables preemption.  These functions | 
 | 292 | don't return a value, so you can't determine whether or not they | 
 | 293 | failed - the assumption is that you wouldn't want to check a return | 
 | 294 | value in the fast logging path anyway, and that they'll always succeed | 
 | 295 | unless the buffer is full and no-overwrite mode is being used, in | 
 | 296 | which case you can detect a failed write in the subbuf_start() | 
 | 297 | callback by calling the relay_buf_full() helper function. | 
 | 298 |  | 
 | 299 | relay_reserve() is used to reserve a slot in a channel buffer which | 
 | 300 | can be written to later.  This would typically be used in applications | 
 | 301 | that need to write directly into a channel buffer without having to | 
 | 302 | stage data in a temporary buffer beforehand.  Because the actual write | 
 | 303 | may not happen immediately after the slot is reserved, applications | 
 | 304 | using relay_reserve() can keep a count of the number of bytes actually | 
 | 305 | written, either in space reserved in the sub-buffers themselves or as | 
 | 306 | a separate array.  See the 'reserve' example in the relay-apps tarball | 
 | 307 | at http://relayfs.sourceforge.net for an example of how this can be | 
 | 308 | done.  Because the write is under control of the client and is | 
 | 309 | separated from the reserve, relay_reserve() doesn't protect the buffer | 
 | 310 | at all - it's up to the client to provide the appropriate | 
 | 311 | synchronization when using relay_reserve(). | 
 | 312 |  | 
 | 313 | Closing a channel | 
 | 314 | ----------------- | 
 | 315 |  | 
 | 316 | The client calls relay_close() when it's finished using the channel. | 
 | 317 | The channel and its associated buffers are destroyed when there are no | 
 | 318 | longer any references to any of the channel buffers.  relay_flush() | 
 | 319 | forces a sub-buffer switch on all the channel buffers, and can be used | 
 | 320 | to finalize and process the last sub-buffers before the channel is | 
 | 321 | closed. | 
 | 322 |  | 
 | 323 | Misc | 
 | 324 | ---- | 
 | 325 |  | 
 | 326 | Some applications may want to keep a channel around and re-use it | 
 | 327 | rather than open and close a new channel for each use.  relay_reset() | 
 | 328 | can be used for this purpose - it resets a channel to its initial | 
 | 329 | state without reallocating channel buffer memory or destroying | 
 | 330 | existing mappings.  It should however only be called when it's safe to | 
 | 331 | do so i.e. when the channel isn't currently being written to. | 
 | 332 |  | 
 | 333 | Finally, there are a couple of utility callbacks that can be used for | 
 | 334 | different purposes.  buf_mapped() is called whenever a channel buffer | 
 | 335 | is mmapped from user space and buf_unmapped() is called when it's | 
 | 336 | unmapped.  The client can use this notification to trigger actions | 
 | 337 | within the kernel application, such as enabling/disabling logging to | 
 | 338 | the channel. | 
 | 339 |  | 
 | 340 |  | 
 | 341 | Resources | 
 | 342 | ========= | 
 | 343 |  | 
 | 344 | For news, example code, mailing list, etc. see the relayfs homepage: | 
 | 345 |  | 
 | 346 |     http://relayfs.sourceforge.net | 
 | 347 |  | 
 | 348 |  | 
 | 349 | Credits | 
 | 350 | ======= | 
 | 351 |  | 
 | 352 | The ideas and specs for relayfs came about as a result of discussions | 
 | 353 | on tracing involving the following: | 
 | 354 |  | 
 | 355 | Michel Dagenais		<michel.dagenais@polymtl.ca> | 
 | 356 | Richard Moore		<richardj_moore@uk.ibm.com> | 
 | 357 | Bob Wisniewski		<bob@watson.ibm.com> | 
 | 358 | Karim Yaghmour		<karim@opersys.com> | 
 | 359 | Tom Zanussi		<zanussi@us.ibm.com> | 
 | 360 |  | 
 | 361 | Also thanks to Hubertus Franke for a lot of useful suggestions and bug | 
 | 362 | reports. |