How to tune the TCP/IP stack for high volume of web requests

How to tune the TCP/IP stack for high volume of web requests

  
Hi All

In environments with a very high number of web requests per second, or with a high number of web services/references integrations, you might find that the application's performance is lower then what you would expect from that system, or even worse, the applications or web services stop responding completely or generate timeout errors, even though your system's resources (CPU, RAM or Network bandwidth) don't seem to be exhausted at all. Although the causes for such symptoms can vary, there's one scenario that can cause a complete lock of systems handling a very large number of web requests per second without any hint of what's going on: TCP/IP port exhaustion.

Web requests consume TCP/IP connections, and each connection is unique an uses resources on your system.  I'm talking about connection ports resources on the TCP/IP stack, because each connection will have a unique port to handle the requests, and these ports are finite: there's only 65535 available ports on the system.

Some ports are reserved and used by the system itself, others are bound by applications permanently (usually until the application terminates, like port 80 for Application Server, or port 3389 for remote desktop, or ports 12000-12004 to OutSystems Services), and there's a set of ports called "ephemeral ports" that will be used by applications as temporary ports.

In the context of a web server, these temporary ports will be used to handled web requests or send web reference requests, and after it's done, they will be released back to the operating system for reuse, allowing these ephemeral ports to be recycled and used by other applications or for other requests.

However, these ephemeral ports do take their time to be completely released to the system, and on a highly loaded environment, where you have hundreds of fast requests per second, you may reach a status where all ephemeral ports are either in use, or "waiting" to be released to the system, and when this happens, the application server will not be able to use them for new connections

But there are ways to tune the TCP/IP stack to reduce the impact of this problem, allowing the system to take advantage of all resources at its disposal. Basically, we can tune several TCP/IP stack parameters, but in this context, there are 2 that really make the different:
  • The time that the ports is in "waiting" status since it was released from the application, and it's in fact released by the system for reuse. It's called the TIME WAIT
  • The maximum number of ephemeral ports that the system can use, from the total pool of 65535 available ports. Let's just call it RANGE EPHEMERAL PORTS
As we would expect, these parameters can vary from operating system to operating system, and they can assume different default values as much the same. So let me detail the default values on supported operating systems.

  Windows 2003 Windows 2008 R2 RedHat 5
TIME WAIT (seconds) 240 240 60
RANGE OF EPHEMERAL PORTS 1024-5000 49152-65535 32768-61000

So how can we improve these parameters for better performance?

We can reduce the time wait to it's minimum allowing a faster recycle of the ephemeral ports, and increase the maximum number of ephemeral ports on the system, improving even further their availability on the system.

How can we do that? That's the easy part. Just follow the instructions below for the correspondent operating system.

Windows 2003

Reduce the TIME_WAIT by setting the TcpTimedWaitDelay TCP/IP parameter to 30 seconds on the windows registry key HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, as a DWORD value.

Increase the range of ephemeral ports by setting the MaxUserPort TCP/IP parameter to an higher value (like 32768), on the windows registry key HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, as a DWORD value. This will set the port range from 1024 to 32768.

Windows 2008 R2

Reduce the TIME_WAIT by setting the TcpTimedWaitDelay TCP/IP parameter to 30 seconds on the windows registry key HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters, as a DWORD value.

Increase the range of ephemeral ports by setting the dynamicportrange to an higher value through the command netsh int ipv4 set dynamicportrange tcp start=32767 num=32768, this will set the port range from 32768 to 65535.

Redhat LInux 5

Reduce the TIME_WAIT by setting the tcp_fin_timeout kernel value on /proc/sys/net/ipv4/tcp_fin_timeout, using the command echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout to set it to 30 seconds.

Increase the range of ephemeral ports by setting ip_local_port_range kernel value on /proc/sys/net/ipv4/ip_local_port_range, using the command echo "32768 65535" > /proc/sys/net/ipv4/ip_local_port_range, this will set the port range from 32768 to 65535.

The kernel value parameters aren't saved with these commands, and are reset to the default values on system reboot, thus make sure to place the commands on a system startup script such as /etc/rc.local.

More related information about the effects of low availability on the ephemeral ports is available at:

Hope this information is helpful in fine tuning your environments for high availability and performance.

Cheers

Miguel João
Hi Miguel,

Is there any way to measure these parameters, so we can know if we are indeed being affected by this problem and also better choose the appropriate values?
I was thinking maybe performance counters, but which ones would be useful?

Thank you,

Paulo Cunha
Hi Paulo

The Windows operating system doesn't have specific performance counters for ephemeral ports, however, you can use other performance counters to identify the overall number of TCP ports opened on the system.

Using the performance counter TCPv4\Connections Established, you can identify the amount of existing connections on the server. If it's near the default limits, then you might have problems establishing new connections, and usually the symptoms are runtime errors with exceptions like:

An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full 

Hope this information is helpful.

Cheers

Miguel Simões João
Hi Miguel,

                   First of all thank you for writing such a clear and good post. It makes it so clear after reading your post about tuning the TCP/IP Stack and to eleminate the dreaded error "
An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full ".

                   But unfortunately for me though it has increased the requests per second on my performance test, which is for huge number 10,000, the error message crops up every time I near the completion of my test; even after tuning the TCP/IP stack described in your post. To say you exactly I get the error whenever I reach the mark of 8500 users in my load test.

                  Before I end just a brief of environment that Iam using, I am suing Visual Studio Team Foundation 2010 for my Test and so I have a Rig created and, My Controller and agents resides on VMs'.

                 I am just puzzled as what more do I need to configure to eleminate this error.

Cheers
Abdul Latif Saquib
Hello Abdul

The error "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full " is not necessarily only caused by the TCP ephemeral ports exhaustion. There's also memory concerns to take into consideration, for instance, in 32 bit environments when using the /3G boot falg on the operation system, this can be a frequent error.

Do you care to share a little bit more about the architecture of your environment? What operating systems are being used? Are they 32 or 64 bits? Which have OutSystems Deployment Controller running and which are only Front-Ends with the OutSystems Deployment service?

When you're performing the load test, and reach the 8500 users, how many TCP connections do you get with the TCPv4\Connections Established, performance counter?

Cheers

Miguel João
Hi Miguel ,

Thanks for your timely reply. Here are the details of my setup.

Architecture details  :
VSTS 2010 controller -1 (Physical Machine)
VSTS 2010 Agents-10 (Vitual machines)
All machines are on Windows 2008 R2 with latest SP
 
Environment is VMware setup :
Web servers -8
Application 2
SQL Database servers 
 
I see these errors on the VSTS results and also on VSTS controller and agents.
I tried tuning the following
TCPwiatdelay -30
MaxuserPOrt -65537
I have now disabled  TCP Auto-Tuning. Thought the TCPWindow size is what I need to alter.

Yet the test starts to throw the dreaded WSAENOBUFS10055 error just after 8500 user load. The memory is present in abundance and I see no pressure on it.

Thanks and Regards
M.A Latif Saquib



 
Hello Abdul

To have more context about the applications you're testing, are they OutSystems applications, or just ASP.NET applications?

Like I've said, the error is not just due to lack of ephemeral ports. It can happen if your memory resources are scretched to thin. I would recommend to monitor the following performance counters on the servers that are generating such error:

- TCPv4\Connections Established
- Memory\Available MBytes
- Memory\Free System Page Tables Entries
- Memory\Pool Nonpaged Bytes
- Memory\Pool Paged bytes
- Paging File\% Usage

If there's memory strangulation on the test, you'll see the Available MBytes reducing, the Page File usage increasing, and the Pool nonpaged and paged bytes also increasing. If they reach their limits, then you've got a memory strangulation on the process.

Also, make sure the Connections Established are way below the 65535 limit.

Do you care to share some of those metrics for analysis?

Cheers

Miguel
Hi Miguel,

                  First of all let me thank you for replying me and rendering your support though a bit late, for I feel you were off for celeberation.

                 I do not have
OutSystems applications but,  just ASP.Net. After reading your good post and many many articles on the web, and monitoring my memorry and other good metrics I could just conclude that, the memory what I had was abundant and their was no way that I got strangulated. One thing which I could conclude was, somehow the load test is opening ports whch where beyond the LIMIT as rightly said by you "65535 " and I have to limit them.

               Alhamdullilah, two days back I was able to curb down the problem. The resolution was just simple at the end, with all the understanding now made me confess that I should use Connection Pool instead of Connection Per User Option which I was using on my Load Test.  In VSTS User Load is not quite proportional to hits or it does not signify exactly as how many hits are being pressurized at the given load, just cannot find the concurrency.
 
           Any ways the solution, I limited the connections by pooling a limited amount of connnections. This avoided the abundance of ports to be opened and also it limited the TCP Stack's backlog Connection Queue from getting flooded.

Well, to be more realistic I may have to use the connection per user option though; Inshallah I feel better off now than before with your support and the progress.

Thanks again Miguel
Best Regards
M.A Latif Saquib
       
I am on windows server 2008 R2.

Got the "parameter incorrect" error on
netsh int ipv4 set dynamicportrange tcp start=32767 num=65535

But the following works
netsh int ipv4 set dynamicportrange tcp start=32767 num=32767
I guess the parameter num in command "netsh int ipv4 set dynamicportrange tcp start=32767 num=32767"
will create a port range from 32767 to 32767+num value. if start+num does not exceed 65534 then we will not get "parameter incorrect"  error.
Hello Terry,

Great catch. After all this time I haven't noticed the typo in the command line for windows. I've fixed it in the original post.

Thanks Pradeep for the insight and explanation.

Cheers

Miguel João
hi there
i am having a problem that my NVR is on static ip(intranet) using internet through isp broadband router(DHCP) the issue is my nvr break ping automically unless i change the port of ethernet i want to know how to prevent router to break the ping i fixed all the DHCP range to prevent conflicting but no result kindly help me with this
Hello Jahangir

Unfortunately, that seems to be a network configuration problem with the router, or some firewall. Can't help you there.

Have you tried contacting the router manufactor.

Cheers
HI
THANKS FOR YOUR REPLY I CAN NOT CONTACT WITH THE MANUFACTOR IT IS AN OEM FROM A NATIONAL BROADBAND COMPANY. I DISABLE THE DHCP DIFFRENCE OOCUR A QUITE BIT.I THINK THERE IS SOME PROBLEM I NEED TO FIND THAT HOW TO PREVANT ROUTER TO BREAK THE PING OF SPECIFIED IP IF YOU CAN GUIDE ME.