|  |  | 
|  | relayfs - a high-speed data relay filesystem | 
|  | ============================================ | 
|  |  | 
|  | relayfs is a filesystem designed to provide an efficient mechanism for | 
|  | tools and facilities to relay large and potentially sustained streams | 
|  | of data from kernel space to user space. | 
|  |  | 
|  | The main abstraction of relayfs is the 'channel'.  A channel consists | 
|  | of a set of per-cpu kernel buffers each represented by a file in the | 
|  | relayfs filesystem.  Kernel clients write into a channel using | 
|  | efficient write functions which automatically log to the current cpu's | 
|  | channel buffer.  User space applications mmap() the per-cpu files and | 
|  | retrieve the data as it becomes available. | 
|  |  | 
|  | The format of the data logged into the channel buffers is completely | 
|  | up to the relayfs client; relayfs does however provide hooks which | 
|  | allow clients to impose some structure on the buffer data.  Nor does | 
|  | relayfs implement any form of data filtering - this also is left to | 
|  | the client.  The purpose is to keep relayfs as simple as possible. | 
|  |  | 
|  | This document provides an overview of the relayfs API.  The details of | 
|  | the function parameters are documented along with the functions in the | 
|  | filesystem code - please see that for details. | 
|  |  | 
|  | Semantics | 
|  | ========= | 
|  |  | 
|  | Each relayfs channel has one buffer per CPU, each buffer has one or | 
|  | more sub-buffers. Messages are written to the first sub-buffer until | 
|  | it is too full to contain a new message, in which case it it is | 
|  | written to the next (if available).  Messages are never split across | 
|  | sub-buffers.  At this point, userspace can be notified so it empties | 
|  | the first sub-buffer, while the kernel continues writing to the next. | 
|  |  | 
|  | When notified that a sub-buffer is full, the kernel knows how many | 
|  | bytes of it are padding i.e. unused.  Userspace can use this knowledge | 
|  | to copy only valid data. | 
|  |  | 
|  | After copying it, userspace can notify the kernel that a sub-buffer | 
|  | has been consumed. | 
|  |  | 
|  | relayfs can operate in a mode where it will overwrite data not yet | 
|  | collected by userspace, and not wait for it to consume it. | 
|  |  | 
|  | relayfs itself does not provide for communication of such data between | 
|  | userspace and kernel, allowing the kernel side to remain simple and | 
|  | not impose a single interface on userspace. It does provide a set of | 
|  | examples and a separate helper though, described below. | 
|  |  | 
|  | klog and relay-apps example code | 
|  | ================================ | 
|  |  | 
|  | relayfs itself is ready to use, but to make things easier, a couple | 
|  | simple utility functions and a set of examples are provided. | 
|  |  | 
|  | The relay-apps example tarball, available on the relayfs sourceforge | 
|  | site, contains a set of self-contained examples, each consisting of a | 
|  | pair of .c files containing boilerplate code for each of the user and | 
|  | kernel sides of a relayfs application; combined these two sets of | 
|  | boilerplate code provide glue to easily stream data to disk, without | 
|  | having to bother with mundane housekeeping chores. | 
|  |  | 
|  | The 'klog debugging functions' patch (klog.patch in the relay-apps | 
|  | tarball) provides a couple of high-level logging functions to the | 
|  | kernel which allow writing formatted text or raw data to a channel, | 
|  | regardless of whether a channel to write into exists or not, or | 
|  | whether relayfs is compiled into the kernel or is configured as a | 
|  | module.  These functions allow you to put unconditional 'trace' | 
|  | statements anywhere in the kernel or kernel modules; only when there | 
|  | is a 'klog handler' registered will data actually be logged (see the | 
|  | klog and kleak examples for details). | 
|  |  | 
|  | It is of course possible to use relayfs from scratch i.e. without | 
|  | using any of the relay-apps example code or klog, but you'll have to | 
|  | implement communication between userspace and kernel, allowing both to | 
|  | convey the state of buffers (full, empty, amount of padding). | 
|  |  | 
|  | klog and the relay-apps examples can be found in the relay-apps | 
|  | tarball on http://relayfs.sourceforge.net | 
|  |  | 
|  |  | 
|  | The relayfs user space API | 
|  | ========================== | 
|  |  | 
|  | relayfs implements basic file operations for user space access to | 
|  | relayfs channel buffer data.  Here are the file operations that are | 
|  | available and some comments regarding their behavior: | 
|  |  | 
|  | open()	 enables user to open an _existing_ buffer. | 
|  |  | 
|  | mmap()	 results in channel buffer being mapped into the caller's | 
|  | memory space. Note that you can't do a partial mmap - you must | 
|  | map the entire file, which is NRBUF * SUBBUFSIZE. | 
|  |  | 
|  | read()	 read the contents of a channel buffer.  The bytes read are | 
|  | 'consumed' by the reader i.e. they won't be available again | 
|  | to subsequent reads.  If the channel is being used in | 
|  | no-overwrite mode (the default), it can be read at any time | 
|  | even if there's an active kernel writer.  If the channel is | 
|  | being used in overwrite mode and there are active channel | 
|  | writers, results may be unpredictable - users should make | 
|  | sure that all logging to the channel has ended before using | 
|  | read() with overwrite mode. | 
|  |  | 
|  | poll()	 POLLIN/POLLRDNORM/POLLERR supported.  User applications are | 
|  | notified when sub-buffer boundaries are crossed. | 
|  |  | 
|  | close() decrements the channel buffer's refcount.  When the refcount | 
|  | reaches 0 i.e. when no process or kernel client has the buffer | 
|  | open, the channel buffer is freed. | 
|  |  | 
|  |  | 
|  | In order for a user application to make use of relayfs files, the | 
|  | relayfs filesystem must be mounted.  For example, | 
|  |  | 
|  | mount -t relayfs relayfs /mnt/relay | 
|  |  | 
|  | NOTE:	relayfs doesn't need to be mounted for kernel clients to create | 
|  | or use channels - it only needs to be mounted when user space | 
|  | applications need access to the buffer data. | 
|  |  | 
|  |  | 
|  | The relayfs kernel API | 
|  | ====================== | 
|  |  | 
|  | Here's a summary of the API relayfs provides to in-kernel clients: | 
|  |  | 
|  |  | 
|  | channel management functions: | 
|  |  | 
|  | relay_open(base_filename, parent, subbuf_size, n_subbufs, | 
|  | callbacks) | 
|  | relay_close(chan) | 
|  | relay_flush(chan) | 
|  | relay_reset(chan) | 
|  | relayfs_create_dir(name, parent) | 
|  | relayfs_remove_dir(dentry) | 
|  | relayfs_create_file(name, parent, mode, fops, data) | 
|  | relayfs_remove_file(dentry) | 
|  |  | 
|  | channel management typically called on instigation of userspace: | 
|  |  | 
|  | relay_subbufs_consumed(chan, cpu, subbufs_consumed) | 
|  |  | 
|  | write functions: | 
|  |  | 
|  | relay_write(chan, data, length) | 
|  | __relay_write(chan, data, length) | 
|  | relay_reserve(chan, length) | 
|  |  | 
|  | callbacks: | 
|  |  | 
|  | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | 
|  | buf_mapped(buf, filp) | 
|  | buf_unmapped(buf, filp) | 
|  | create_buf_file(filename, parent, mode, buf, is_global) | 
|  | remove_buf_file(dentry) | 
|  |  | 
|  | helper functions: | 
|  |  | 
|  | relay_buf_full(buf) | 
|  | subbuf_start_reserve(buf, length) | 
|  |  | 
|  |  | 
|  | Creating a channel | 
|  | ------------------ | 
|  |  | 
|  | relay_open() is used to create a channel, along with its per-cpu | 
|  | channel buffers.  Each channel buffer will have an associated file | 
|  | created for it in the relayfs filesystem, which can be opened and | 
|  | mmapped from user space if desired.  The files are named | 
|  | basename0...basenameN-1 where N is the number of online cpus, and by | 
|  | default will be created in the root of the filesystem.  If you want a | 
|  | directory structure to contain your relayfs files, you can create it | 
|  | with relayfs_create_dir() and pass the parent directory to | 
|  | relay_open().  Clients are responsible for cleaning up any directory | 
|  | structure they create when the channel is closed - use | 
|  | relayfs_remove_dir() for that. | 
|  |  | 
|  | The total size of each per-cpu buffer is calculated by multiplying the | 
|  | number of sub-buffers by the sub-buffer size passed into relay_open(). | 
|  | The idea behind sub-buffers is that they're basically an extension of | 
|  | double-buffering to N buffers, and they also allow applications to | 
|  | easily implement random-access-on-buffer-boundary schemes, which can | 
|  | be important for some high-volume applications.  The number and size | 
|  | of sub-buffers is completely dependent on the application and even for | 
|  | the same application, different conditions will warrant different | 
|  | values for these parameters at different times.  Typically, the right | 
|  | values to use are best decided after some experimentation; in general, | 
|  | though, it's safe to assume that having only 1 sub-buffer is a bad | 
|  | idea - you're guaranteed to either overwrite data or lose events | 
|  | depending on the channel mode being used. | 
|  |  | 
|  | Channel 'modes' | 
|  | --------------- | 
|  |  | 
|  | relayfs channels can be used in either of two modes - 'overwrite' or | 
|  | 'no-overwrite'.  The mode is entirely determined by the implementation | 
|  | of the subbuf_start() callback, as described below.  In 'overwrite' | 
|  | mode, also known as 'flight recorder' mode, writes continuously cycle | 
|  | around the buffer and will never fail, but will unconditionally | 
|  | overwrite old data regardless of whether it's actually been consumed. | 
|  | In no-overwrite mode, writes will fail i.e. data will be lost, if the | 
|  | number of unconsumed sub-buffers equals the total number of | 
|  | sub-buffers in the channel.  It should be clear that if there is no | 
|  | consumer or if the consumer can't consume sub-buffers fast enought, | 
|  | data will be lost in either case; the only difference is whether data | 
|  | is lost from the beginning or the end of a buffer. | 
|  |  | 
|  | As explained above, a relayfs channel is made of up one or more | 
|  | per-cpu channel buffers, each implemented as a circular buffer | 
|  | subdivided into one or more sub-buffers.  Messages are written into | 
|  | the current sub-buffer of the channel's current per-cpu buffer via the | 
|  | write functions described below.  Whenever a message can't fit into | 
|  | the current sub-buffer, because there's no room left for it, the | 
|  | client is notified via the subbuf_start() callback that a switch to a | 
|  | new sub-buffer is about to occur.  The client uses this callback to 1) | 
|  | initialize the next sub-buffer if appropriate 2) finalize the previous | 
|  | sub-buffer if appropriate and 3) return a boolean value indicating | 
|  | whether or not to actually go ahead with the sub-buffer switch. | 
|  |  | 
|  | To implement 'no-overwrite' mode, the userspace client would provide | 
|  | an implementation of the subbuf_start() callback something like the | 
|  | following: | 
|  |  | 
|  | static int subbuf_start(struct rchan_buf *buf, | 
|  | void *subbuf, | 
|  | void *prev_subbuf, | 
|  | unsigned int prev_padding) | 
|  | { | 
|  | if (prev_subbuf) | 
|  | *((unsigned *)prev_subbuf) = prev_padding; | 
|  |  | 
|  | if (relay_buf_full(buf)) | 
|  | return 0; | 
|  |  | 
|  | subbuf_start_reserve(buf, sizeof(unsigned int)); | 
|  |  | 
|  | return 1; | 
|  | } | 
|  |  | 
|  | If the current buffer is full i.e. all sub-buffers remain unconsumed, | 
|  | the callback returns 0 to indicate that the buffer switch should not | 
|  | occur yet i.e. until the consumer has had a chance to read the current | 
|  | set of ready sub-buffers.  For the relay_buf_full() function to make | 
|  | sense, the consumer is reponsible for notifying relayfs when | 
|  | sub-buffers have been consumed via relay_subbufs_consumed().  Any | 
|  | subsequent attempts to write into the buffer will again invoke the | 
|  | subbuf_start() callback with the same parameters; only when the | 
|  | consumer has consumed one or more of the ready sub-buffers will | 
|  | relay_buf_full() return 0, in which case the buffer switch can | 
|  | continue. | 
|  |  | 
|  | The implementation of the subbuf_start() callback for 'overwrite' mode | 
|  | would be very similar: | 
|  |  | 
|  | static int subbuf_start(struct rchan_buf *buf, | 
|  | void *subbuf, | 
|  | void *prev_subbuf, | 
|  | unsigned int prev_padding) | 
|  | { | 
|  | if (prev_subbuf) | 
|  | *((unsigned *)prev_subbuf) = prev_padding; | 
|  |  | 
|  | subbuf_start_reserve(buf, sizeof(unsigned int)); | 
|  |  | 
|  | return 1; | 
|  | } | 
|  |  | 
|  | In this case, the relay_buf_full() check is meaningless and the | 
|  | callback always returns 1, causing the buffer switch to occur | 
|  | unconditionally.  It's also meaningless for the client to use the | 
|  | relay_subbufs_consumed() function in this mode, as it's never | 
|  | consulted. | 
|  |  | 
|  | The default subbuf_start() implementation, used if the client doesn't | 
|  | define any callbacks, or doesn't define the subbuf_start() callback, | 
|  | implements the simplest possible 'no-overwrite' mode i.e. it does | 
|  | nothing but return 0. | 
|  |  | 
|  | Header information can be reserved at the beginning of each sub-buffer | 
|  | by calling the subbuf_start_reserve() helper function from within the | 
|  | subbuf_start() callback.  This reserved area can be used to store | 
|  | whatever information the client wants.  In the example above, room is | 
|  | reserved in each sub-buffer to store the padding count for that | 
|  | sub-buffer.  This is filled in for the previous sub-buffer in the | 
|  | subbuf_start() implementation; the padding value for the previous | 
|  | sub-buffer is passed into the subbuf_start() callback along with a | 
|  | pointer to the previous sub-buffer, since the padding value isn't | 
|  | known until a sub-buffer is filled.  The subbuf_start() callback is | 
|  | also called for the first sub-buffer when the channel is opened, to | 
|  | give the client a chance to reserve space in it.  In this case the | 
|  | previous sub-buffer pointer passed into the callback will be NULL, so | 
|  | the client should check the value of the prev_subbuf pointer before | 
|  | writing into the previous sub-buffer. | 
|  |  | 
|  | Writing to a channel | 
|  | -------------------- | 
|  |  | 
|  | kernel clients write data into the current cpu's channel buffer using | 
|  | relay_write() or __relay_write().  relay_write() is the main logging | 
|  | function - it uses local_irqsave() to protect the buffer and should be | 
|  | used if you might be logging from interrupt context.  If you know | 
|  | you'll never be logging from interrupt context, you can use | 
|  | __relay_write(), which only disables preemption.  These functions | 
|  | don't return a value, so you can't determine whether or not they | 
|  | failed - the assumption is that you wouldn't want to check a return | 
|  | value in the fast logging path anyway, and that they'll always succeed | 
|  | unless the buffer is full and no-overwrite mode is being used, in | 
|  | which case you can detect a failed write in the subbuf_start() | 
|  | callback by calling the relay_buf_full() helper function. | 
|  |  | 
|  | relay_reserve() is used to reserve a slot in a channel buffer which | 
|  | can be written to later.  This would typically be used in applications | 
|  | that need to write directly into a channel buffer without having to | 
|  | stage data in a temporary buffer beforehand.  Because the actual write | 
|  | may not happen immediately after the slot is reserved, applications | 
|  | using relay_reserve() can keep a count of the number of bytes actually | 
|  | written, either in space reserved in the sub-buffers themselves or as | 
|  | a separate array.  See the 'reserve' example in the relay-apps tarball | 
|  | at http://relayfs.sourceforge.net for an example of how this can be | 
|  | done.  Because the write is under control of the client and is | 
|  | separated from the reserve, relay_reserve() doesn't protect the buffer | 
|  | at all - it's up to the client to provide the appropriate | 
|  | synchronization when using relay_reserve(). | 
|  |  | 
|  | Closing a channel | 
|  | ----------------- | 
|  |  | 
|  | The client calls relay_close() when it's finished using the channel. | 
|  | The channel and its associated buffers are destroyed when there are no | 
|  | longer any references to any of the channel buffers.  relay_flush() | 
|  | forces a sub-buffer switch on all the channel buffers, and can be used | 
|  | to finalize and process the last sub-buffers before the channel is | 
|  | closed. | 
|  |  | 
|  | Creating non-relay files | 
|  | ------------------------ | 
|  |  | 
|  | relay_open() automatically creates files in the relayfs filesystem to | 
|  | represent the per-cpu kernel buffers; it's often useful for | 
|  | applications to be able to create their own files alongside the relay | 
|  | files in the relayfs filesystem as well e.g. 'control' files much like | 
|  | those created in /proc or debugfs for similar purposes, used to | 
|  | communicate control information between the kernel and user sides of a | 
|  | relayfs application.  For this purpose the relayfs_create_file() and | 
|  | relayfs_remove_file() API functions exist.  For relayfs_create_file(), | 
|  | the caller passes in a set of user-defined file operations to be used | 
|  | for the file and an optional void * to a user-specified data item, | 
|  | which will be accessible via inode->u.generic_ip (see the relay-apps | 
|  | tarball for examples).  The file_operations are a required parameter | 
|  | to relayfs_create_file() and thus the semantics of these files are | 
|  | completely defined by the caller. | 
|  |  | 
|  | See the relay-apps tarball at http://relayfs.sourceforge.net for | 
|  | examples of how these non-relay files are meant to be used. | 
|  |  | 
|  | Creating relay files in other filesystems | 
|  | ----------------------------------------- | 
|  |  | 
|  | By default of course, relay_open() creates relay files in the relayfs | 
|  | filesystem.  Because relay_file_operations is exported, however, it's | 
|  | also possible to create and use relay files in other pseudo-filesytems | 
|  | such as debugfs. | 
|  |  | 
|  | For this purpose, two callback functions are provided, | 
|  | create_buf_file() and remove_buf_file().  create_buf_file() is called | 
|  | once for each per-cpu buffer from relay_open() to allow the client to | 
|  | create a file to be used to represent the corresponding buffer; if | 
|  | this callback is not defined, the default implementation will create | 
|  | and return a file in the relayfs filesystem to represent the buffer. | 
|  | The callback should return the dentry of the file created to represent | 
|  | the relay buffer.  Note that the parent directory passed to | 
|  | relay_open() (and passed along to the callback), if specified, must | 
|  | exist in the same filesystem the new relay file is created in.  If | 
|  | create_buf_file() is defined, remove_buf_file() must also be defined; | 
|  | it's responsible for deleting the file(s) created in create_buf_file() | 
|  | and is called during relay_close(). | 
|  |  | 
|  | The create_buf_file() implementation can also be defined in such a way | 
|  | as to allow the creation of a single 'global' buffer instead of the | 
|  | default per-cpu set.  This can be useful for applications interested | 
|  | mainly in seeing the relative ordering of system-wide events without | 
|  | the need to bother with saving explicit timestamps for the purpose of | 
|  | merging/sorting per-cpu files in a postprocessing step. | 
|  |  | 
|  | To have relay_open() create a global buffer, the create_buf_file() | 
|  | implementation should set the value of the is_global outparam to a | 
|  | non-zero value in addition to creating the file that will be used to | 
|  | represent the single buffer.  In the case of a global buffer, | 
|  | create_buf_file() and remove_buf_file() will be called only once.  The | 
|  | normal channel-writing functions e.g. relay_write() can still be used | 
|  | - writes from any cpu will transparently end up in the global buffer - | 
|  | but since it is a global buffer, callers should make sure they use the | 
|  | proper locking for such a buffer, either by wrapping writes in a | 
|  | spinlock, or by copying a write function from relayfs_fs.h and | 
|  | creating a local version that internally does the proper locking. | 
|  |  | 
|  | See the 'exported-relayfile' examples in the relay-apps tarball for | 
|  | examples of creating and using relay files in debugfs. | 
|  |  | 
|  | Misc | 
|  | ---- | 
|  |  | 
|  | Some applications may want to keep a channel around and re-use it | 
|  | rather than open and close a new channel for each use.  relay_reset() | 
|  | can be used for this purpose - it resets a channel to its initial | 
|  | state without reallocating channel buffer memory or destroying | 
|  | existing mappings.  It should however only be called when it's safe to | 
|  | do so i.e. when the channel isn't currently being written to. | 
|  |  | 
|  | Finally, there are a couple of utility callbacks that can be used for | 
|  | different purposes.  buf_mapped() is called whenever a channel buffer | 
|  | is mmapped from user space and buf_unmapped() is called when it's | 
|  | unmapped.  The client can use this notification to trigger actions | 
|  | within the kernel application, such as enabling/disabling logging to | 
|  | the channel. | 
|  |  | 
|  |  | 
|  | Resources | 
|  | ========= | 
|  |  | 
|  | For news, example code, mailing list, etc. see the relayfs homepage: | 
|  |  | 
|  | http://relayfs.sourceforge.net | 
|  |  | 
|  |  | 
|  | Credits | 
|  | ======= | 
|  |  | 
|  | The ideas and specs for relayfs came about as a result of discussions | 
|  | on tracing involving the following: | 
|  |  | 
|  | Michel Dagenais		<michel.dagenais@polymtl.ca> | 
|  | Richard Moore		<richardj_moore@uk.ibm.com> | 
|  | Bob Wisniewski		<bob@watson.ibm.com> | 
|  | Karim Yaghmour		<karim@opersys.com> | 
|  | Tom Zanussi		<zanussi@us.ibm.com> | 
|  |  | 
|  | Also thanks to Hubertus Franke for a lot of useful suggestions and bug | 
|  | reports. |