| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | -*-Mode: outline-*- | 
 | 2 |  | 
 | 3 | 		Light-weight System Calls for IA-64 | 
 | 4 | 		----------------------------------- | 
 | 5 |  | 
 | 6 | 		        Started: 13-Jan-2003 | 
 | 7 | 		    Last update: 27-Sep-2003 | 
 | 8 |  | 
 | 9 | 	              David Mosberger-Tang | 
 | 10 | 		      <davidm@hpl.hp.com> | 
 | 11 |  | 
 | 12 | Using the "epc" instruction effectively introduces a new mode of | 
 | 13 | execution to the ia64 linux kernel.  We call this mode the | 
 | 14 | "fsys-mode".  To recap, the normal states of execution are: | 
 | 15 |  | 
 | 16 |   - kernel mode: | 
 | 17 | 	Both the register stack and the memory stack have been | 
 | 18 | 	switched over to kernel memory.  The user-level state is saved | 
 | 19 | 	in a pt-regs structure at the top of the kernel memory stack. | 
 | 20 |  | 
 | 21 |   - user mode: | 
 | 22 | 	Both the register stack and the kernel stack are in | 
 | 23 | 	user memory.  The user-level state is contained in the | 
 | 24 | 	CPU registers. | 
 | 25 |  | 
 | 26 |   - bank 0 interruption-handling mode: | 
 | 27 | 	This is the non-interruptible state which all | 
 | 28 | 	interruption-handlers start execution in.  The user-level | 
 | 29 | 	state remains in the CPU registers and some kernel state may | 
 | 30 | 	be stored in bank 0 of registers r16-r31. | 
 | 31 |  | 
 | 32 | In contrast, fsys-mode has the following special properties: | 
 | 33 |  | 
 | 34 |   - execution is at privilege level 0 (most-privileged) | 
 | 35 |  | 
 | 36 |   - CPU registers may contain a mixture of user-level and kernel-level | 
 | 37 |     state (it is the responsibility of the kernel to ensure that no | 
 | 38 |     security-sensitive kernel-level state is leaked back to | 
 | 39 |     user-level) | 
 | 40 |  | 
 | 41 |   - execution is interruptible and preemptible (an fsys-mode handler | 
 | 42 |     can disable interrupts and avoid all other interruption-sources | 
 | 43 |     to avoid preemption) | 
 | 44 |  | 
 | 45 |   - neither the memory-stack nor the register-stack can be trusted while | 
 | 46 |     in fsys-mode (they point to the user-level stacks, which may | 
 | 47 |     be invalid, or completely bogus addresses) | 
 | 48 |  | 
 | 49 | In summary, fsys-mode is much more similar to running in user-mode | 
 | 50 | than it is to running in kernel-mode.  Of course, given that the | 
 | 51 | privilege level is at level 0, this means that fsys-mode requires some | 
 | 52 | care (see below). | 
 | 53 |  | 
 | 54 |  | 
 | 55 | * How to tell fsys-mode | 
 | 56 |  | 
 | 57 | Linux operates in fsys-mode when (a) the privilege level is 0 (most | 
 | 58 | privileged) and (b) the stacks have NOT been switched to kernel memory | 
 | 59 | yet.  For convenience, the header file <asm-ia64/ptrace.h> provides | 
 | 60 | three macros: | 
 | 61 |  | 
 | 62 | 	user_mode(regs) | 
 | 63 | 	user_stack(task,regs) | 
 | 64 | 	fsys_mode(task,regs) | 
 | 65 |  | 
 | 66 | The "regs" argument is a pointer to a pt_regs structure.  The "task" | 
 | 67 | argument is a pointer to the task structure to which the "regs" | 
 | 68 | pointer belongs to.  user_mode() returns TRUE if the CPU state pointed | 
 | 69 | to by "regs" was executing in user mode (privilege level 3). | 
 | 70 | user_stack() returns TRUE if the state pointed to by "regs" was | 
 | 71 | executing on the user-level stack(s).  Finally, fsys_mode() returns | 
 | 72 | TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. | 
 | 73 | The fsys_mode() macro is equivalent to the expression: | 
 | 74 |  | 
 | 75 | 	!user_mode(regs) && user_stack(task,regs) | 
 | 76 |  | 
 | 77 | * How to write an fsyscall handler | 
 | 78 |  | 
 | 79 | The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers | 
 | 80 | (fsyscall_table).  This table contains one entry for each system call. | 
 | 81 | By default, a system call is handled by fsys_fallback_syscall().  This | 
 | 82 | routine takes care of entering (full) kernel mode and calling the | 
 | 83 | normal Linux system call handler.  For performance-critical system | 
 | 84 | calls, it is possible to write a hand-tuned fsyscall_handler.  For | 
 | 85 | example, fsys.S contains fsys_getpid(), which is a hand-tuned version | 
 | 86 | of the getpid() system call. | 
 | 87 |  | 
 | 88 | The entry and exit-state of an fsyscall handler is as follows: | 
 | 89 |  | 
 | 90 | ** Machine state on entry to fsyscall handler: | 
 | 91 |  | 
 | 92 |  - r10	  = 0 | 
 | 93 |  - r11	  = saved ar.pfs (a user-level value) | 
 | 94 |  - r15	  = system call number | 
 | 95 |  - r16	  = "current" task pointer (in normal kernel-mode, this is in r13) | 
 | 96 |  - r32-r39 = system call arguments | 
 | 97 |  - b6	  = return address (a user-level value) | 
 | 98 |  - ar.pfs = previous frame-state (a user-level value) | 
 | 99 |  - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) | 
 | 100 |  - all other registers may contain values passed in from user-mode | 
 | 101 |  | 
 | 102 | ** Required machine state on exit to fsyscall handler: | 
 | 103 |  | 
 | 104 |  - r11	  = saved ar.pfs (as passed into the fsyscall handler) | 
 | 105 |  - r15	  = system call number (as passed into the fsyscall handler) | 
 | 106 |  - r32-r39 = system call arguments (as passed into the fsyscall handler) | 
 | 107 |  - b6	  = return address (as passed into the fsyscall handler) | 
 | 108 |  - ar.pfs = previous frame-state (as passed into the fsyscall handler) | 
 | 109 |  | 
 | 110 | Fsyscall handlers can execute with very little overhead, but with that | 
 | 111 | speed comes a set of restrictions: | 
 | 112 |  | 
 | 113 |  o Fsyscall-handlers MUST check for any pending work in the flags | 
 | 114 |    member of the thread-info structure and if any of the | 
 | 115 |    TIF_ALLWORK_MASK flags are set, the handler needs to fall back on | 
 | 116 |    doing a full system call (by calling fsys_fallback_syscall). | 
 | 117 |  | 
 | 118 |  o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, | 
 | 119 |    r15, b6, and ar.pfs) because they will be needed in case of a | 
 | 120 |    system call restart.  Of course, all "preserved" registers also | 
 | 121 |    must be preserved, in accordance to the normal calling conventions. | 
 | 122 |  | 
 | 123 |  o Fsyscall-handlers MUST check argument registers for containing a | 
 | 124 |    NaT value before using them in any way that could trigger a | 
 | 125 |    NaT-consumption fault.  If a system call argument is found to | 
 | 126 |    contain a NaT value, an fsyscall-handler may return immediately | 
 | 127 |    with r8=EINVAL, r10=-1. | 
 | 128 |  | 
 | 129 |  o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform | 
 | 130 |    any other operation that would trigger mandatory RSE | 
 | 131 |    (register-stack engine) traffic. | 
 | 132 |  | 
 | 133 |  o Fsyscall-handlers MUST NOT write to any stacked registers because | 
 | 134 |    it is not safe to assume that user-level called a handler with the | 
 | 135 |    proper number of arguments. | 
 | 136 |  | 
 | 137 |  o Fsyscall-handlers need to be careful when accessing per-CPU variables: | 
 | 138 |    unless proper safe-guards are taken (e.g., interruptions are avoided), | 
 | 139 |    execution may be pre-empted and resumed on another CPU at any given | 
 | 140 |    time. | 
 | 141 |  | 
 | 142 |  o Fsyscall-handlers must be careful not to leak sensitive kernel' | 
 | 143 |    information back to user-level.  In particular, before returning to | 
 | 144 |    user-level, care needs to be taken to clear any scratch registers | 
 | 145 |    that could contain sensitive information (note that the current | 
 | 146 |    task pointer is not considered sensitive: it's already exposed | 
 | 147 |    through ar.k6). | 
 | 148 |  | 
 | 149 |  o Fsyscall-handlers MUST NOT access user-memory without first | 
 | 150 |    validating access-permission (this can be done typically via | 
 | 151 |    probe.r.fault and/or probe.w.fault) and without guarding against | 
 | 152 |    memory access exceptions (this can be done with the EX() macros | 
 | 153 |    defined by asmmacro.h). | 
 | 154 |  | 
 | 155 | The above restrictions may seem draconian, but remember that it's | 
 | 156 | possible to trade off some of the restrictions by paying a slightly | 
 | 157 | higher overhead.  For example, if an fsyscall-handler could benefit | 
 | 158 | from the shadow register bank, it could temporarily disable PSR.i and | 
 | 159 | PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as | 
 | 160 | needed.  In other words, following the above rules yields extremely | 
 | 161 | fast system call execution (while fully preserving system call | 
 | 162 | semantics), but there is also a lot of flexibility in handling more | 
 | 163 | complicated cases. | 
 | 164 |  | 
 | 165 | * Signal handling | 
 | 166 |  | 
 | 167 | The delivery of (asynchronous) signals must be delayed until fsys-mode | 
| Matt LaPlante | 3f6dee9 | 2006-10-03 22:45:33 +0200 | [diff] [blame] | 168 | is exited.  This is accomplished with the help of the lower-privilege | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 169 | transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() | 
 | 170 | checks whether the interrupted task was in fsys-mode and, if so, sets | 
 | 171 | PSR.lp and returns immediately.  When fsys-mode is exited via the | 
 | 172 | "br.ret" instruction that lowers the privilege level, a trap will | 
 | 173 | occur.  The trap handler clears PSR.lp again and returns immediately. | 
 | 174 | The kernel exit path then checks for and delivers any pending signals. | 
 | 175 |  | 
 | 176 | * PSR Handling | 
 | 177 |  | 
 | 178 | The "epc" instruction doesn't change the contents of PSR at all.  This | 
 | 179 | is in contrast to a regular interruption, which clears almost all | 
 | 180 | bits.  Because of that, some care needs to be taken to ensure things | 
 | 181 | work as expected.  The following discussion describes how each PSR bit | 
 | 182 | is handled. | 
 | 183 |  | 
 | 184 | PSR.be	Cleared when entering fsys-mode.  A srlz.d instruction is used | 
 | 185 | 	to ensure the CPU is in little-endian mode before the first | 
 | 186 | 	load/store instruction is executed.  PSR.be is normally NOT | 
 | 187 | 	restored upon return from an fsys-mode handler.  In other | 
 | 188 | 	words, user-level code must not rely on PSR.be being preserved | 
 | 189 | 	across a system call. | 
 | 190 | PSR.up	Unchanged. | 
 | 191 | PSR.ac	Unchanged. | 
 | 192 | PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers! | 
 | 193 | PSR.mfh	Unchanged.  Note: fsys-mode handlers must not write-registers! | 
 | 194 | PSR.ic	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed. | 
 | 195 | PSR.i	Unchanged.  Note: fsys-mode handlers can clear the bit, if needed. | 
 | 196 | PSR.pk	Unchanged. | 
 | 197 | PSR.dt	Unchanged. | 
 | 198 | PSR.dfl	Unchanged.  Note: fsys-mode handlers must not write-registers! | 
 | 199 | PSR.dfh	Unchanged.  Note: fsys-mode handlers must not write-registers! | 
 | 200 | PSR.sp	Unchanged. | 
 | 201 | PSR.pp	Unchanged. | 
 | 202 | PSR.di	Unchanged. | 
 | 203 | PSR.si	Unchanged. | 
 | 204 | PSR.db	Unchanged.  The kernel prevents user-level from setting a hardware | 
 | 205 | 	breakpoint that triggers at any privilege level other than 3 (user-mode). | 
 | 206 | PSR.lp	Unchanged. | 
 | 207 | PSR.tb	Lazy redirect.  If a taken-branch trap occurs while in | 
 | 208 | 	fsys-mode, the trap-handler modifies the saved machine state | 
 | 209 | 	such that execution resumes in the gate page at | 
 | 210 | 	syscall_via_break(), with privilege level 3.  Note: the | 
 | 211 | 	taken branch would occur on the branch invoking the | 
 | 212 | 	fsyscall-handler, at which point, by definition, a syscall | 
 | 213 | 	restart is still safe.  If the system call number is invalid, | 
 | 214 | 	the fsys-mode handler will return directly to user-level.  This | 
 | 215 | 	return will trigger a taken-branch trap, but since the trap is | 
 | 216 | 	taken _after_ restoring the privilege level, the CPU has already | 
 | 217 | 	left fsys-mode, so no special treatment is needed. | 
 | 218 | PSR.rt	Unchanged. | 
 | 219 | PSR.cpl	Cleared to 0. | 
 | 220 | PSR.is	Unchanged (guaranteed to be 0 on entry to the gate page). | 
 | 221 | PSR.mc	Unchanged. | 
 | 222 | PSR.it	Unchanged (guaranteed to be 1). | 
 | 223 | PSR.id	Unchanged.  Note: the ia64 linux kernel never sets this bit. | 
 | 224 | PSR.da	Unchanged.  Note: the ia64 linux kernel never sets this bit. | 
 | 225 | PSR.dd	Unchanged.  Note: the ia64 linux kernel never sets this bit. | 
 | 226 | PSR.ss	Lazy redirect.  If set, "epc" will cause a Single Step Trap to | 
 | 227 | 	be taken.  The trap handler then modifies the saved machine | 
 | 228 | 	state such that execution resumes in the gate page at | 
 | 229 | 	syscall_via_break(), with privilege level 3. | 
 | 230 | PSR.ri	Unchanged. | 
 | 231 | PSR.ed	Unchanged.  Note: This bit could only have an effect if an fsys-mode | 
 | 232 | 	handler performed a speculative load that gets NaTted.  If so, this | 
 | 233 | 	would be the normal & expected behavior, so no special treatment is | 
 | 234 | 	needed. | 
 | 235 | PSR.bn	Unchanged.  Note: fsys-mode handlers may clear the bit, if needed. | 
 | 236 | 	Doing so requires clearing PSR.i and PSR.ic as well. | 
 | 237 | PSR.ia	Unchanged.  Note: the ia64 linux kernel never sets this bit. | 
 | 238 |  | 
 | 239 | * Using fast system calls | 
 | 240 |  | 
 | 241 | To use fast system calls, userspace applications need simply call | 
 | 242 | __kernel_syscall_via_epc().  For example | 
 | 243 |  | 
 | 244 | -- example fgettimeofday() call -- | 
 | 245 | -- fgettimeofday.S -- | 
 | 246 |  | 
 | 247 | #include <asm/asmmacro.h> | 
 | 248 |  | 
 | 249 | GLOBAL_ENTRY(fgettimeofday) | 
 | 250 | .prologue | 
 | 251 | .save ar.pfs, r11 | 
 | 252 | mov r11 = ar.pfs | 
 | 253 | .body  | 
 | 254 |  | 
 | 255 | mov r2 = 0xa000000000020660;;  // gate address  | 
 | 256 | 			       // found by inspection of System.map for the  | 
 | 257 | 			       // __kernel_syscall_via_epc() function.  See | 
 | 258 | 			       // below for how to do this for real. | 
 | 259 |  | 
 | 260 | mov b7 = r2 | 
 | 261 | mov r15 = 1087		       // gettimeofday syscall | 
 | 262 | ;; | 
 | 263 | br.call.sptk.many b6 = b7 | 
 | 264 | ;; | 
 | 265 |  | 
 | 266 | .restore sp | 
 | 267 |  | 
 | 268 | mov ar.pfs = r11 | 
 | 269 | br.ret.sptk.many rp;;	      // return to caller | 
 | 270 | END(fgettimeofday) | 
 | 271 |  | 
 | 272 | -- end fgettimeofday.S -- | 
 | 273 |  | 
 | 274 | In reality, getting the gate address is accomplished by two extra | 
 | 275 | values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) | 
 | 276 |  | 
 | 277 |  o AT_SYSINFO : is the address of __kernel_syscall_via_epc() | 
 | 278 |  o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO | 
 | 279 |  | 
 | 280 | The ELF DSO is a pre-linked library that is mapped in by the kernel at | 
 | 281 | the gate page.  It is a proper ELF shared object so, with a dynamic | 
 | 282 | loader that recognises the library, you should be able to make calls to | 
 | 283 | the exported functions within it as with any other shared library. | 
 | 284 | AT_SYSINFO points into the kernel DSO at the | 
 | 285 | __kernel_syscall_via_epc() function for historical reasons (it was | 
 | 286 | used before the kernel DSO) and as a convenience. |