| Michał Mirosław | e5b1de1 | 2011-07-12 22:27:00 -0700 | [diff] [blame] | 1 | Netdev features mess and how to get out from it alive | 
|  | 2 | ===================================================== | 
|  | 3 |  | 
|  | 4 | Author: | 
|  | 5 | Michał Mirosław <mirq-linux@rere.qmqm.pl> | 
|  | 6 |  | 
|  | 7 |  | 
|  | 8 |  | 
|  | 9 | Part I: Feature sets | 
|  | 10 | ====================== | 
|  | 11 |  | 
|  | 12 | Long gone are the days when a network card would just take and give packets | 
|  | 13 | verbatim.  Today's devices add multiple features and bugs (read: offloads) | 
|  | 14 | that relieve an OS of various tasks like generating and checking checksums, | 
|  | 15 | splitting packets, classifying them.  Those capabilities and their state | 
|  | 16 | are commonly referred to as netdev features in Linux kernel world. | 
|  | 17 |  | 
|  | 18 | There are currently three sets of features relevant to the driver, and | 
|  | 19 | one used internally by network core: | 
|  | 20 |  | 
|  | 21 | 1. netdev->hw_features set contains features whose state may possibly | 
|  | 22 | be changed (enabled or disabled) for a particular device by user's | 
|  | 23 | request.  This set should be initialized in ndo_init callback and not | 
|  | 24 | changed later. | 
|  | 25 |  | 
|  | 26 | 2. netdev->features set contains features which are currently enabled | 
|  | 27 | for a device.  This should be changed only by network core or in | 
|  | 28 | error paths of ndo_set_features callback. | 
|  | 29 |  | 
|  | 30 | 3. netdev->vlan_features set contains features whose state is inherited | 
|  | 31 | by child VLAN devices (limits netdev->features set).  This is currently | 
|  | 32 | used for all VLAN devices whether tags are stripped or inserted in | 
|  | 33 | hardware or software. | 
|  | 34 |  | 
|  | 35 | 4. netdev->wanted_features set contains feature set requested by user. | 
|  | 36 | This set is filtered by ndo_fix_features callback whenever it or | 
|  | 37 | some device-specific conditions change. This set is internal to | 
|  | 38 | networking core and should not be referenced in drivers. | 
|  | 39 |  | 
|  | 40 |  | 
|  | 41 |  | 
|  | 42 | Part II: Controlling enabled features | 
|  | 43 | ======================================= | 
|  | 44 |  | 
|  | 45 | When current feature set (netdev->features) is to be changed, new set | 
|  | 46 | is calculated and filtered by calling ndo_fix_features callback | 
|  | 47 | and netdev_fix_features(). If the resulting set differs from current | 
|  | 48 | set, it is passed to ndo_set_features callback and (if the callback | 
|  | 49 | returns success) replaces value stored in netdev->features. | 
|  | 50 | NETDEV_FEAT_CHANGE notification is issued after that whenever current | 
|  | 51 | set might have changed. | 
|  | 52 |  | 
|  | 53 | The following events trigger recalculation: | 
|  | 54 | 1. device's registration, after ndo_init returned success | 
|  | 55 | 2. user requested changes in features state | 
|  | 56 | 3. netdev_update_features() is called | 
|  | 57 |  | 
|  | 58 | ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks | 
|  | 59 | are treated as always returning success. | 
|  | 60 |  | 
|  | 61 | A driver that wants to trigger recalculation must do so by calling | 
|  | 62 | netdev_update_features() while holding rtnl_lock. This should not be done | 
|  | 63 | from ndo_*_features callbacks. netdev->features should not be modified by | 
|  | 64 | driver except by means of ndo_fix_features callback. | 
|  | 65 |  | 
|  | 66 |  | 
|  | 67 |  | 
|  | 68 | Part III: Implementation hints | 
|  | 69 | ================================ | 
|  | 70 |  | 
|  | 71 | * ndo_fix_features: | 
|  | 72 |  | 
|  | 73 | All dependencies between features should be resolved here. The resulting | 
|  | 74 | set can be reduced further by networking core imposed limitations (as coded | 
|  | 75 | in netdev_fix_features()). For this reason it is safer to disable a feature | 
|  | 76 | when its dependencies are not met instead of forcing the dependency on. | 
|  | 77 |  | 
|  | 78 | This callback should not modify hardware nor driver state (should be | 
|  | 79 | stateless).  It can be called multiple times between successive | 
|  | 80 | ndo_set_features calls. | 
|  | 81 |  | 
|  | 82 | Callback must not alter features contained in NETIF_F_SOFT_FEATURES or | 
|  | 83 | NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but | 
|  | 84 | care must be taken as the change won't affect already configured VLANs. | 
|  | 85 |  | 
|  | 86 | * ndo_set_features: | 
|  | 87 |  | 
|  | 88 | Hardware should be reconfigured to match passed feature set. The set | 
|  | 89 | should not be altered unless some error condition happens that can't | 
|  | 90 | be reliably detected in ndo_fix_features. In this case, the callback | 
|  | 91 | should update netdev->features to match resulting hardware state. | 
|  | 92 | Errors returned are not (and cannot be) propagated anywhere except dmesg. | 
|  | 93 | (Note: successful return is zero, >0 means silent error.) | 
|  | 94 |  | 
|  | 95 |  | 
|  | 96 |  | 
|  | 97 | Part IV: Features | 
|  | 98 | =================== | 
|  | 99 |  | 
|  | 100 | For current list of features, see include/linux/netdev_features.h. | 
|  | 101 | This section describes semantics of some of them. | 
|  | 102 |  | 
|  | 103 | * Transmit checksumming | 
|  | 104 |  | 
|  | 105 | For complete description, see comments near the top of include/linux/skbuff.h. | 
|  | 106 |  | 
|  | 107 | Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. | 
|  | 108 | It means that device can fill TCP/UDP-like checksum anywhere in the packets | 
|  | 109 | whatever headers there might be. | 
|  | 110 |  | 
|  | 111 | * Transmit TCP segmentation offload | 
|  | 112 |  | 
|  | 113 | NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit | 
|  | 114 | set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). | 
|  | 115 |  | 
|  | 116 | * Transmit DMA from high memory | 
|  | 117 |  | 
|  | 118 | On platforms where this is relevant, NETIF_F_HIGHDMA signals that | 
|  | 119 | ndo_start_xmit can handle skbs with frags in high memory. | 
|  | 120 |  | 
|  | 121 | * Transmit scatter-gather | 
|  | 122 |  | 
|  | 123 | Those features say that ndo_start_xmit can handle fragmented skbs: | 
|  | 124 | NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- | 
|  | 125 | chained skbs (skb->next/prev list). | 
|  | 126 |  | 
|  | 127 | * Software features | 
|  | 128 |  | 
|  | 129 | Features contained in NETIF_F_SOFT_FEATURES are features of networking | 
|  | 130 | stack. Driver should not change behaviour based on them. | 
|  | 131 |  | 
|  | 132 | * LLTX driver (deprecated for hardware drivers) | 
|  | 133 |  | 
|  | 134 | NETIF_F_LLTX should be set in drivers that implement their own locking in | 
|  | 135 | transmit path or don't need locking at all (e.g. software tunnels). | 
|  | 136 | In ndo_start_xmit, it is recommended to use a try_lock and return | 
|  | 137 | NETDEV_TX_LOCKED when the spin lock fails.  The locking should also properly | 
|  | 138 | protect against other callbacks (the rules you need to find out). | 
|  | 139 |  | 
|  | 140 | Don't use it for new drivers. | 
|  | 141 |  | 
|  | 142 | * netns-local device | 
|  | 143 |  | 
|  | 144 | NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between | 
|  | 145 | network namespaces (e.g. loopback). | 
|  | 146 |  | 
|  | 147 | Don't use it in drivers. | 
|  | 148 |  | 
|  | 149 | * VLAN challenged | 
|  | 150 |  | 
|  | 151 | NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN | 
|  | 152 | headers. Some drivers set this because the cards can't handle the bigger MTU. | 
|  | 153 | [FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU | 
|  | 154 | VLANs. This may be not useful, though.] |