This is my second post on performance tuning the OSX network stack. My previous post on this topic has gotten over 55,000 views and generated quite a bit of feedback and questions. This updated post is intended to address performance tuning the IP stack on OSX Mavericks and, hopefully, eliminate some of the confusion around many of the caveats and options, as this is a fairly complex topic. Much of this configuration and the settings are backwards compatible to previous OSX releases. However, please refer to my original post for the gory details and caveats related to previous OSX releases.
MBUF Memory Allocation
Probably the most fundamental yet debatable system option for network tuning is memory allocation to the mbuf buffer pools. This boot-time parameter governs the scalability of network connections, in terms of the simultaneous number supported and how much data can be queued in either direction. This is the basis for determining most of the buffer and socket threshold values which can impact network throughput and connection scalability. The default value seems to be set based on the amount of physical RAM your system has installed. Based on my research, BSD, the underlying OS that OS X is built on top of, defaults to assigning 64MB of memory to the mbuf pool. This memory must be fairly allocated with per socket/connection thresholds to reasonably protect you from starving the entire system with just a handful of network connections. Thus, you want to set a limit as to how much memory a single connection can consume, balanced based on peak profile usage. I have not been able to identify if OSX detects the amount of system RAM upon install and sets the built-in boot argument as part of the install script or if it is just arbitrarily defined for a particular OSX release. Anyone that has this information, feel free to share it here in the comments. I’d appreciate it.
OSX Mavericks 10.9 running on a brand new Macbook with 16Gig of RAM appears to have been configured to allocate 128MB of memory to the mbuf pool. That is double the value I see set on my Macbook running Mountain Lion 10.8 with 8Gig of RAM. I have not found a way to reveal the actual allocation value. But, what I do know is the standard way the system calculates the buffer settings and can reverse those values to arrive at the number.
There is a system setting called kern.ipc.maxsockbuf. This value sets the maximum amount of Bytes of memory that can be allocated to a single socket. The standard calculation sets this value to 1/16th of the total amount of memory allocated to the mbuf pool. In the case of my Macbook running Mavericks 10.9, that means I could have a total of 16 sockets open each with a maximum of 8MB of buffer memory assigned. That is a considerable amount of memory per socket. Typical application connections on a user machine, under normal network conditions, won’t normally use more than 128 to 256KB of buffer memory at peak utilization. So by allocating more memory to the mbuf pool, you allow the system the ability to manage more simultaneous connections with the ability to dynamically expand their buffers to improve performance as needed. This is usually more of an issue with servers because they must be built to support large numbers of simultaneous connections from many application users and potentially a mix of connections that require considerably high throughput, depending on the applications in use.
My recommendation is to let the system select the defaults for kern.ipc.maxsockbuf and kern.ipc.nmbclusters. The more physical memory your system has, the more connections and throughput per connection you will be able to obtain. If you decide you really need to raise these limits because you like to run multiple simultaneous backup jobs across your network interface or you are hosting a busy webserver or you are continuously running Torrents or you have an unhealthy obsession with seeing how far you can push your system just for fun, then you will need to add an NVRAM boot argument to do so. I would first verify if you are even running into memory exhaustion.
Use the command ‘netstat -m’ from a terminal prompt. The most important metrics are
0 requests for memory denied
0 requests for memory delayed
If you are not seeing any hits in these counters, you are not taxing your buffer memory. Otherwise, increasing your mbuf pool will provide some performance improvement or ability to support more simultaneous connections.
If you really want more memory allocated you can update your NVRAM settings with the following command:
sudo nvram boot-args=”ncl=131072″
This would allocate 256MB of memory to the mbuf pool which is the recommendation for a 10Gig interface.
For a system with 16Gig of RAM, the ncl or mbuf value appears to be set to 65536. This equals 65536 2KB clusters which equates to 131072KB or 128MB of RAM. This should be more than enough for 1Gig interfaces and below. 1/16th of that pool which is equivalent to the kern.ipc.maxsockbuf value is 8MB or 8388608 Bytes. You can confirm your own system setup by running the command ‘sudo sysctl -a | grep kern.ipc.maxsockbuf’. Multiply that value by 16 and then divide by 1024 twice to arrive at the number of MegaBytes of RAM allocated to your ncl or mbuf pool. The above NVRAM boot-arg, which is recommended to be a binary multiple to align evenly with memory cluster allocation, will assign 131072 2KB clusters to the mbuf pool. That is 262144KB = 256MB of RAM. It is up to you how much RAM you want to dedicate to your network buffers. The defaults will work just fine for 99% of the users out there on 1 Gig interfaces or lower. If you are needing to tune I would stress test your typical usage behavior and monitor the output of ‘netstat -m’ and see if you are hitting denied memory requests. If so, then try adjusting your mbuf boot-arg to double whatever you calculate your default mbuf value to be. Use the ‘sudo nvram boot-args=”mbufvalue” ‘ command (replacing mbufvalue with the number you calculate) and reboot the system and try again.
For users with 10Gig interfaces the recommendation is to assign at least 256MB of memory to the mbuf pools. This should result in a kern.ipc.maxsockbuf value of 16777216.
Base Config Recommendations
Now that I’ve pretty much made everyone’s eyes glaze over on that one fundamental setting, I’ll get on with the bulk of the configuration details. I am still researching the details on the TCP autotuning features that are now included in OSX Mavericks 10.9. At first glance, the relevant parameters appear to be automatically derived from the sendspace and recvspace values that are manually set below. I have not been able to map the specific labeled parameters on OSX Mavericks back to the BSD standard settings, yet. It is a good possibility there is nothing to be done with these settings.
For reference, here are the custom settings I have added to my own sysctl.conf file:
kern.ipc.somaxconn=2048 net.inet.tcp.rfc1323=1 net.inet.tcp.win_scale_factor=4 net.inet.tcp.sendspace=1042560 net.inet.tcp.recvspace=1042560 net.inet.tcp.mssdflt=1448 net.inet.tcp.v6mssdflt=1412 net.inet.tcp.msl=15000 net.inet.tcp.always_keepalive=0 net.inet.tcp.delayed_ack=3 net.inet.tcp.slowstart_flightsize=20 net.inet.tcp.local_slowstart_flightsize=9 net.inet.tcp.blackhole=2 net.inet.udp.blackhole=1 net.inet.icmp.icmplim=50
The easiest way to edit this file is to open a Terminal window and execute ‘sudo nano /etc/sysctl.conf’. The sudo command allows you to elevate your rights to admin. You will be prompted to enter your password if you have admin rights. nano is the name of the command line text editor program. The above entries just get added to this file one line at a time.
You can also update your running settings without rebooting by using the ‘sudo sysctl -w’ command. Just append each of the above settings one at a time after this command. kern.ipc.maxsockets and kern.ipc.nmbclusters can only be modified from the sysctl.conf file upon reboot.
Explanation of Configuration Options
Following you will find my explanations about each of the parameters I have customized or included in my sysctl.conf file:
You’ll note that I have removed the system parameters to adjust the socket buffers and nmbclusters. The defaults on a system with 16Gig of RAM are more than enough for the stuff that I do. Read above on the explanation to tune these using the actual memory allocation to the mbuf pool. Changing that one setting will automatically adjust all of the other settings accordingly without having to rely on your own math skills.
The net.inet.tcp.sockthreshold setting has also been deprecated in OSX 10.9 Mavericks. I have been looking for any documentation relating to this change and have not yet found anything useful. This setting used to be used as a sort of grooming mechanism to trigger an arbitrary socket count threshold at which the system would start obeying the values you set in the net.inet.tcp.recvspace and net.inet.tcp.sendspace parameters. With the absence of this parameter, the question is if it is no longer adjustable or if this feature has been completely removed. Anyone with info on this change, please feel free to share here.
- kern.ipc.somaxconn limits the maximum number of sockets that can be open at any one time. The default here is just 128. If an attacker can flood you with a sufficiently high number of SYN packets in a short enough period of time, all of your possible network connections will be used up, thus successfully denying your users access to the service. Increasing this value is also beneficial if you run any automated programs like P2P clients that can drain your connection pool real quickly.
- I have hard-coded the enabling of RFC1323 (net.inet.tcp.rfc1323) which is the TCP High Performance Options (Window Scaling). This should be on by default on OSX Mavericks 10.9. It has been a default setting, since at least Snow Leopard OSX 10.6, if not even earlier on Leopard OSX 10.5. It should be noted that this setting also enables TCP Timestamps by default. This adds an additional 12 bytes to the TCP header, thus reducing the MSS from 1460 to 1448 bytes. The Window scale factor, at this point, is arbitrarily set to a default of 3. I have intentionally hard-coded the Window Scaling factor to 3 because it matches my need to fill up my particular Internet connection. Ensuring this value is set to 3 allows me the ability to fully utilize my 45Meg AT&T U-verse connection. I calculated this based on my Internet connection’s bandwidth-delay product (BDP). On average I should be able to achieve 45Mbps or 45 x 106bits per second. My average maximum roundtrip latency is somewhere around 50 milliseconds or 0.05 seconds. 45 x 106 x 0.05 = 2,250,000 bits. So, my Internet connection can nominally sustain approximately 2,250,000 / 8 bits per byte = 281,250 Bytes of data in-transit on the network from Point A to Point Z. If my aim is to fully utilize my Internet bandwidth, setting a Window value that allows at a minimum double that amount in either direction would be recommended. The TCP window field in an IP packet is 16 bits wide yielding a maximum possible value of 65535 Bytes. A window scaling factor of 3 which is the same as saying a factor of 23 = 8 is more than enough to fill my Internet connection twice. If the TCP window is set to 65535 with a window scale factor of 3, I would be able to transmit 23 x 65,535 Bytes = 524,280 Bytes on the network before requiring an ACK packet. So, a value of 3 for the Window Scale Factor setting should be more than adequate for the vast majority of individual’s Internet connections at 100Mbps or less. Once you get beyond 100Mbps with an average peak latency around 50 milliseconds, you might want to consider bumping the Window Scale Factor up to 4, as I have done, since I have a Gig connection on my local network. You would also want to recalculate your BDP and determine the appropriate window size based on a Window Scale Factor of 4 or 24 = 16If you notice unacceptably poor performance with key applications you use, I would suggest you try disabling the RFC1323 option altogether and make sure your net.inet.tcp.sendspace and net.inet.tcp.recvspace values are set to exactly 65535. Any applications that have load balanced servers with Window scaling enabled and are using a Layer 5 type load balancing ruleset (e.g.load balance based on URL or object type) can exhibit severe throughput problems if the explicit window scaling configuration has not been properly addressed on the Load Balancer. This is another fairly complex issue with how load balancers manage TCP connections with Layer 5 rules and how Window Scaling is negotiated during the TCP setup handshake. When not configured properly, you will end up with the 2 endpoints in a transaction that do not use the same Window Scaling factor or one end is Window Scaling and the other is not.
- The net.inet.tcp.sendspace and net.inet.tcp.recvspace settings control the maximum TCP window size the system will allow sending from the machine or receiving to the machine. Up until the latest releases of most operating systems, these values defaulted to 65535 Bytes. This has been the de facto standard essentially from the beginning of the TCP protocol’s existence in the 1970’s. Now that the RFC1323 High Performance TCP Options are starting to be more widely accepted and configured, these values can be increased to improve performance. I have set mine both to 1042560 bytes. That is almost a factor of 16 times the old default 65535 limit. I arrived at this value using the following calculation: MSS x (16bit Window/MSS) x (Window Scale Factor) = 1448 x 45 x 24 = 1042560. If I wanted to keep it an even multiple of the maximum TCP Window field width of 16bits or 65535, I could round up to 16 x 65535 = 1048560 Bytes. There is no hard and fast rule on that. Generally, the most optimal setting would be to use the calculated value based on your own network’s BDP. You may want to factor in the worst case scenario between your Internet connection and your local LAN connection. In my case, I have opted to use the numbers for my local Gig connection. TCP autotuning should take care of my Internet connection. In the case of my Internet connection, if I doubled my current 45Mbps bandwidth and the average latency factor stayed the same, I would want to double my TCP window size to be able to utilize my full bandwidth.Following is how I arrived at these numbers:
- The MSS I am using is 1448 because I have RFC1323 enabled which enables TCP Timestamps and reduces the default MSS of 1460 bytes by 12 bytes to 1448 bytes.
- 24 matches the Windows Scaling Factor I have chosen to configure.
- The value of 45 is a little bit more convoluted to figure out. This number is a multiple of the MSS that is less than or equal to the max TCP Window 16bit field value of 65535 bytes. So, 1448 x 45 = 65160. If you were using an MSS of 1460, this value would be set to 44. But, in the case of OSX, since TCP Timestamps are automatically enabled when you enable RFC1323, you shouldn’t set the MSS higher than 1448. It might be less if you have additional overhead on your line such as PPPoE on a DSL line etc.
You must have the RFC1323 options enabled, in order to set these values above 65535.
- The net.inet.tcp.mssdflt setting seems simple to configure on the surface. However, arriving at the optimum setting for your particular network setup and requirements can be a mathematical exercise that is not straightforward. The default MSS value that Apple has configured is a measly 512 bytes. That setting value is really targeted to be optimal for dial-up users or users with fairly slow broadband connections ~3Mbps and below. The impact is not really noticeable on a high speed LAN segment. But it can be a performance bottleneck across a typical residential broadband connection with higher latency. This setting adjusts the Maximum Segment Size that your system can transmit. You need to understand the characteristics of your own network connection, in order to determine the appropriate value. For a machine that only communicates with other hosts across a normal Ethernet network, the answer is very simple. The value should be set to 1460 bytes, as this is the standard MSS on Ethernet networks. IP packets have a standard 40 byte header. With a standard Maximum Transmission Unit (MTU) of 1500 bytes on Ethernet, that would leave 1460 bytes for payload in the IP packet. In my case, I had a DSL line that used PPPoE for its transport protocol. In order to get the most out of that DSL line and avoid wasteful protocol overhead, I wanted this value to be exactly equal to the amount of payload data I could attach within a single PPPoE frame to avoid fragmenting segments which causes additional PPPoE frames and ATM Cells to be created which adds to the overall overhead on my DSL line and reduces my effective bandwidth. There are quite a few references out there to help you determine the appropriate setting. So, to configure for a DSL line that uses PPPoE like mine, an appropriate MSS value would be 1452 bytes. 1460 bytes is the normal MSS on Ethernet for IP traffic, as I described earlier. With PPPoE you have to subtract an additional 8 bytes of overhead for the PPPoE header. That leaves you with an MSS of 1452 bytes. There is one other element to account for. ATM. Many DSL providers, like mine, use the ATM protocol as the underlying transport carrier for your PPPoE data. That used to be the only way it was done. ATM uses 53 byte cells of which each cell has a 5 byte header. That leaves 48 bytes for payload in each cell. If I set my MSS to 1452 bytes, that does not divide evenly across ATM’s 48 byte cell payloads. 1452/48 = 30.25 I am left with 12 bytes of additional data to send at the end. Ultimately ATM will fill the last cell with 36 bytes of null data in that scenario. To avoid this overhead, I reduce the MSS to 1440 bytes so that it will evenly fit into the ATM cells. 30 * 48 = 1440 < 1452.I now have AT&T U-verse which uses VDSL2+ with Packet Transfer Mode (PTM) as the underlying transport protocol. It provides a native MTU of 1500 Bytes. So this eliminates all the complexity of the above calculations and takes things back to the default of 1460 bytes. However, if you have enabled the RFC1323 option for TCP Window Scaling, the MSS should be set to 1448 to account for the 12 byte TCP Timestamp headers that OSX includes when that option is enabled.
- The setting net.inet.tcp.v6mssdflt adjusts the default MSS for IPv6 packets. A large majority of users do not have IPv6 access yet, so this setting is not important at this point. If you are an AT&T U-verse customer and you don’t have IPv6 yet, you will very soon and probably not even know it. With AT&T, my IPv6 connectivity is not delivered natively. AT&T uses what is called a 6rd tunnel from the customer’s residential gateway (modem) to their Border Relay router to provide IPv6 access. AT&T sets the MTU on the 6rd tunnel at 1472 Bytes. So, the IPv6 MSS must be calculated starting from this point. The standard IPv6 header is 40 Bytes. The TCP header is 20 Bytes. An MTU of 1472 Bytes minus 60 Bytes of IPv6 overhead leaves us with an IPv6 MSS of 1412 Bytes. This config will depend on your IPv6 setup and whether you have native IPv6 access or are using one of the wide variety of tunnel or translational mechanisms to gain access to the IPv6 Internet.
When running 6rd tunneling for IPv6 access, the best practice, if possible, is minimally to configure the IPv6 MTU to 1480 on the router Ethernet interfaces connected to the network segments with IPv6 clients. In the case of AT&T I have to set it to 1472. The IPv6 Router Advertisement (RA) message sent from the router to the local segments will advertise this non-standard MTU to all the attached devices. They will automatically derive their default MSS as 1472 Bytes MTU – 40 Bytes IPv6 header – 20 Bytes of TCP header = 1412 Bytes. The good news is that IPv6 is pretty efficient at doing Path MTU Discovery and adjusting its own MSS on the fly. This configuration setting provides the best chance for no packet fragmentation without discovery delay.
For your reference, I found the following note related to IPv6 performance on BSD which is the underlying OS for Mac OSX: Testing has shown that on end-to-end 10G paths, IPV6 appears to be about 40% slower than IPV4 on FreeBSD 7.3, and 20% slower on FreeBSD 8.2. This is a known FreeBSD issue, and will be addressed in a future release.
- net.inet.tcp.msl defines the Maximum Segment Life. This is the maximum amount of time to wait for a TCP ACK in reply to a TCP SYN-ACK or FIN-ACK, in milliseconds. If an ACK is not received in this time, the segment can be considered “lost” and the network connection is freed. This setting is primarily about DoS protection but it is also important when it comes to TCP sequence reuse or Twrap. There are two implications for this. When you are trying to close a connection, if the final ACK is lost or delayed, the socket will still close, and more quickly. However if a client is trying to open a connection to you and their ACK is delayed more than 7500ms, the connection will not form. RFC 753 defines the MSL as 120 seconds (120000ms), however this was written in 1979 and timing issues have changed slightly since then. Today, FreeBSD’s default is 30000ms. This is sufficient for most conditions, but for stronger DoS protection you will want to lower this. I have set mine to 15000 or 15 seconds. This will work best for speeds up to 1Gbps. See Section 1.2 on TCP Reliability starting on Page 4 of RFC1323 for a good description of the importance of TCP MSL as it relates to link bandwidth and TCP sequence reuse or Twrap. If you are using Gig links, you should set this value shorter than 17 seconds or 17000 milliseconds to prevent TCP sequence reuse issues.Most IP stack implementations that are RFC1323 compliant now include TCP Timestamps and the (PAWS) Protection Against Wrapped Sequence numbers mechanism to counteract the Twrap problem with a lengthy MSL on a high speed link. This reduces the MSL feature to only be relevant to the length of time a system will permit a segment to live on the network without an ACK. It is probably still a good idea to keep this value fairly low with the higher bandwidth and lower latency connections of today.
- net.inet.tcp.delayed_ack controls the behavior when sending TCP acknowledgements. Allowing delayed ACKs can cause pauses at the tail end of data transfers and used to be a known problem for Macs. This was due to a known poor interaction with the Nagle algorithm in the TCP stack when dealing with slow start and congestion control. I previously had recommended disabling this feature completely by setting it to “0”. I have since learned that Apple has updated the behavior of Delayed ACK to resolve this problem. Since the release of OSX 10.5 Leopard, Apple integrated support for Greg Minshall’s “Proposed Modification to Nagle’s Algorithm” into the Delayed ACK feature. This fixes the problem for 10/100/1000 Meg interfaces. I have now reverted this setting back to the default and enabled this feature in auto-detect mode by setting the value to “3”. This effectively enables the Nagle algorithm but prevents the unacknowledged runt packet problem causing an ACK deadlock which can unnecessarily pause transfers and cause significant delays.
Update: As Florian has indicated in the comments below, on interfaces above 1Gig (e.g. 10Gig), it appears that the delayed_ack feature presents performance issues once again. The large majority of users will not be impacted by this behavior on 100Meg or 1Gig interfaces. Hopefully I will be able to find some further reference information on this to post here. So, for now, if you are using a 10Gig interface you will want to disable this feature.
For your reference, following are the available options:
- delayed_ack=0 responds after every packet (OFF)
- delayed_ack=1 always employs delayed ack, 6 packets can get 1 ack
- delayed_ack=2 immediate ack after 2nd packet, 2 packets per ack (Compatibility Mode)
- delayed_ack=3 should auto detect when to employ delayed ack, 4 packets per ack. (DEFAULT recommended)
- net.inet.tcp.slowstart_flightsize sets the number of outstanding packets permitted with non-local systems during the slowstart phase of TCP ramp up. In order to more quickly overcome TCP slowstart, I have bumped this up to a value of 20. This allows my system to use up to 10% of my bandwidth during TCP ramp up. I calculated this by figuring my Bandwidth-Delay Product and taking 10% of that value divided by the max MSS of 1448 bytes to get rough packet count. So, taking the line rate at 45Mbps or 45 x 106 x 50 milliseconds or 0.05 seconds / 8 bits per byte / 1448 bytes per packet x 10%, I came up with roughly 20 packets.
- net.inet.tcp.local_slowstart_flightsize is the same as above but only applies to connections on the local network. Typically you can be liberal and set this to be less restrictive than the above setting. However, locally you usually have higher bandwidth @ 100Mbps or 1Gbps and lower latency at ~1 msec or less. If I followed the same formula as above, I’d come up with less than 1 packet on a 100Mbps connection with ~1 msec of latency. In my case, I have a 1Gig connection. AT 10% that works out to 8.6 packets so I rounded it up to 9 packets.
- net.inet.tcp.blackhole defines what happens when a TCP packet is received on a closed port. When set to ‘1’, SYN packets arriving on a closed port will be dropped without a RST packet being sent back. When set to ‘2’, all packets arriving on a closed port are dropped without an RST being sent back. This saves both CPU time because packets don’t need to be processed as much, and outbound bandwidth as packets are not sent out.
- net.inet.udp.blackhole is similar to net.inet.tcp.blackhole in its function. As the UDP protocol does not have states like TCP, there is only a need for one choice when it comes to dropping UDP packets. When net.inet.udp.blackhole is set to ‘1’, all UDP packets arriving on a closed port will be dropped.
- The name ‘net.inet.icmp.icmplim‘ is somewhat misleading. This sysctl controls the maximum number of ICMP “Unreachable” and also TCP RST packets that will be sent back every second. It helps curb the effects of attacks which generate a lot of reply packets. I have set mine to a value of 50.
- Mac Geekery’s Network Tuning Guide – (offline) Archive link
- TCP Tuning Guide for FreeBSD
- TCP Performance problems caused by interaction between Nagle’s Algorithm and Delayed ACK
- Wikipedia.org – Bandwidth-Delay Product
- IETF RFC1323 – TCP Extensions for High Performance
- Adjusting network buffer memory on OSX
- Explanation of socket buffer memory allocation
- OSX Boot-time kernel arguments
- Detailed explanation of the BSD Initialization process of the Mac OSX kernel
- Energy Sciences Network (U.S. Department of Energy)
- The Mac Observer podcast – at ~37 minutes
- Jeremy’s Toolbox
- SuperUser Q&A site
- D-Mac’s Stuff
- Jason Hu