2009-05-16

Detection of Half-Open (Dropped) Connections

(This post is part of the TCP/IP .NET Sockets FAQ)

There is a three-way handshake to open a TCP/IP connection, and a four-way handshake to close it. However, once the connection has been established, if neither side sends any data, then no packets are sent over the connection. TCP is an "idle" protocol, happy to assume that the connection is active until proven otherwise.

TCP was designed this way for resiliency and efficiency. This design enables a graceful recovery from unplugged network cables and router crashes. e.g., a client may connect to a server, an intermediate router may be rebooted, and after the router comes back up, the original connection still exists (this is true unless data is sent across the connection while the router was down). This design is also efficient, since no "polling" packets are sent across the network just to check if the connection is still OK (reduces unnecessary network traffic).

TCP does have acknowledgments for data, so when one side sends data to the other side, it will receive an acknowledgment if the connection is stil active (or an error if it is not). Thus, broken connections can be detected by sending out data. It is important to note that the act of receiving data is completely passive in TCP; a socket that only reads cannot detect a dropped connection.

This leads to a scenario known as a "half-open connection". At any given point in most protocols, one side is expected to send a message and the other side is expecting to receive it. Consider what happens if an intermediate router is suddenly rebooted at that point: the receiving side will continue waiting for the message to arrive; the sending side will send its data, and receive an error indicating the connection was lost. Since broken connections can only be detected by sending data, the receiving side will wait forever. This scenario is called a "half-open connection" because one side realizes the connection was lost but the other side believes it is still active.

Terminology alert: "half-open" is completely different than "half-closed". Half-closed connections are when one side performs a Shutdown operation on its socket, shutting down only the sending (outgoing) stream. See Socket Operations for more details on the Shutdown operation.

Causes of Half-Open Connections

Half-open connections are in that annoying list of problems that one seldomly sees in a test environment but commonly happen in the real world. This is because if the socket is shut down with the normal four-way handshake (or even if it is abruptly closed), the half-open problem will not occur. Some of the common causes of a half-open connection are described below:

  • Process crash. If a process shuts down normally, it usually sends out a "FIN" packet, which informs the other side that the connection has been lost. However, if a process crashes or is terminated (e.g., from Task Manager), this is not guaranteed. It is possible that the OS will send out a "FIN" packet on behalf of a crashed process; however, this is up to the OS.
  • Computer crash. If the entire computer (including the OS) crashes or loses power, then there is certainly no notification to the other side that the connection has been lost.
  • Router crash/reboot. Any of the routers along the route from one side to the other may also crash or be rebooted; this causes a loss of connection if data is being sent at that time. If no data is being sent at that exact time, then the connection is not lost.
  • Network cable unplugged. Any network cables unplugged along the route from one side to the other will cause a loss of connection without any notification. This is similar to the router case; if there is no data being transferred, then the connection is not actually lost. However, computers usually will detect if their specific network cable is unplugged and may notify their local sockets that the network was lost (the remote side will not be notified).
  • Wireless devices (including laptops) moving out of range. A wireless device that moves out of its access point range will lose its connection. This is an often-overlooked but increasingly common situation.

In all of the situations above, it is possible that one side may be aware of the loss of connection, while the other side is not.

Is Explicit Detection Necessary?

There are some situations in which detection is not necessary. A "poll" system (as opposed to a "subscription/event" system) already has a timer built in (the poll timer), and sends data across the connection regularly. So the polling side does not need to explicitly check for connection loss.

The necessity of detection must be considered separately for each side of the communication. e.g., if the protocol is based on a polling scheme, then the side doing the polling does not need explicit keepalive handling, but the side responding to the polling likely does need explicit keepalive handling.

True Story: I once had to write software to control a serial device that operated through a "bridge" device that exposed the serial port over TCP/IP. The company that developed the bridge implemented a simple protocol: they listened for a single TCP/IP connection (from anywhere), and - once the connection was established - sent any data received from the TCP/IP connection to the serial port, and any data received from the serial port to the TCP/IP connection. Of course, they only allowed one TCP/IP connection (otherwise, there could be contention over the serial port), so other connections were refused as long as there was an established connection.

The problem? No keepalives. If the bridge ever ended up in a half-open situation, it would never recover; any connection requests would be rejected because the bridge would believe the original connection was still active. Usually, the bridge was deployed to a stationary device on a physical network; presumably, if the device ever stopped working, someone would walk over and perform a power cycle. However, we were deploying the bridge onto mobile devices on a wireless network, and it was normal for our devices to pass out of and back into access point coverage. Furthermore, this was part of an automated system, and people weren't near the devices to perform a power cycle. Of course, the bridge failed during our prototyping; when we brought the root cause to the other company's attention, they were unable to implement a keepalive (the embedded TCP/IP stack didn't support it), so they worked with us in developing a method of remotely resetting the bridge.

It's important to note that we did have keepalive testing on our side of the connection (via a timer), but this was insufficient. It is necessary to have keepalive testing on both sides of the connection.

This bridge was in full production, and had been for some time. The company that made this error was a billion-dollar global corporation centered around networking products. The company I worked for had four programmers at the time. This just goes to show that even the big guys can make mistakes.

Wrong Methods to Detect Dropped Connections

There are a couple of wrong methods to detect dropped connections. Beginning socket programmers often come up with these incorrect solutions to the half-open problem. They are listed here only for reference, along with a brief description of why they are wrong.

  • A Second socket connection. A new socket connection cannot determine the validity of an existing connection in all cases. In particular, if the remote side has crashed and rebooted, then a second connection attempt will succeed even though the original connection is in a half-open state.
  • Ping. Sending a ping (ICMP) to the remote side has the same problem: it may succeed even when the connection is unusable. Furthermore, ICMP traffic is often treated differently than TCP traffic by routers.

Correct Methods to Detect Dropped Connections

There are several correct solutions to the half-open problem. Each one has their pros and cons, depending on the problem domain. This list is in order from best solution to worst solution (IMO):

  1. Add a keepalive message to the application protocol framing (an empty message). Length-prefixed and delimited systems may send empty messages (e.g., a length prefix of "0 bytes" or a single "end delimiter").
    Advantages. The higher-level protocol (the actual messages) are not affected.
    Disadvantages. This requires a change to the software on both sides of the connection, so it may not be an option if the application protocol is already specified and immutable.
  2. Add a keepalive message to the actual application protocol (a "null" message). This adds a new message to the application protocol: a "null" message that should just be ignored.
    Advantages. This may be used if the application protocol uses a non-uniform message framing system. In this case, the first solution could not be used.
    Disadvantages. (Same as the first solution) This requires a change to the software on both sides of the connection, so it may not be an option if the application protocol is already specified and immutable.
  3. Explicit timer assuming the worst. Have a timer and assume that the connection has been dropped when the timer expires (of course, the timer is reset each time data is transferred). This is the way HTTP servers work, if they support persistent connections.
    Advantages. Does not require changes to the application protocol; in situations where the code on the remote side cannot be changed, the first two solutions cannot be used. Furthermore, this solution causes less network traffic; it is the only solution that does not involve sending out keepalive (i.e., "are you still there?") packets.
    Disadvantages. Depending on the protocol, this may cause a high number of valid connections to be dropped.
  4. Manipulate the TCP/IP keepalive packet settings. This is a highly controversial solution that has complex arguments for both pros and cons. It is discussed in depth in Stevens' book, chapter 23. Essentially, this instructs the TCP/IP stack to send keepalive packets periodically on the application's behalf. There are two ways that this can be done:
    1. Set SocketOptionName.KeepAlive. The MSDN documentation isn't clear that this uses a 2-hour timeout, which is too long for most applications. This can be changed (system-wide) through a registry key, but changing this system-wide (i.e., for all other applications) is greatly frowned upon. This is the old-fashioned way to enable keepalive packets.
    2. Set per-connection keepalives. Keepalive parameters can be set per-connection only on Windows 2000 and newer, not the old 9x line. This has to be done by issuing I/O control codes to the socket: pass IOControlCode.KeepAliveValues along with a structure to Socket.IOControl; the necessary structure is not covered by the .NET documentation but is described in the unmanaged documentation for WSAIoctl (SIO_KEEPALIVE_VALS).
    Advantages. Once the code to set the keepalive parameters is working, there is nothing else that the application needs to change. The other solutions all have timer events that the application must respond to; this one is "set and forget".
    Disadvantages. RFC 1122, section 4.2.3.6 indicates that acknowledgements for TCP keepalives without data may not be transmitted reliably by routers; this may cause valid connections to be dropped. Furthermore, TCP/IP stacks are not required to support keepalives at all (and many embedded stacks do not), so this solution may not translate to other platforms.

Each side of the application protocol may employ different keepalive solutions, and even different keepalive solutions at different states in the protocol. For example, the client side of a request/response style protocol may choose to send "null" requests when there is not a request pending, and switch to a timeout solution while waiting for a response.

However, when designing a new protocol, it is best to employ one of the solutions consistently.

(This post is part of the TCP/IP .NET Sockets FAQ)

32 comments:

  1. An excellent information. Thanks a lot, I was just going to implement the solution 3 without knowing if this was a good solution.

    Thanks again,

    ReplyDelete
  2. Thanks for posting this, very clear and helpful!

    ReplyDelete
  3. What an excellent post, I was also going to implement solution 3, but wanted to do a little extra research first.

    ReplyDelete
  4. Thank you so much. You just helped me realize why my connections were unstable. Excellent post.

    ReplyDelete
  5. If you would like to know how to implement 4.2 see the solution at the following link: http://stackoverflow.com/questions/1993635/c-alternative-to-networkstream-read-that-indicates-remote-host-has-closed-the-c

    ReplyDelete
  6. Hi All,

    I'm not a dot net programmer, but need to have solution of notifying dropped connections for java and c#. What I found on internet is that with sending urgent data (as a separate stream) on existing connection, I'm able to notice if connection is still alive before sending response back (http://en.wikipedia.org/wiki/Transmission_Control_Protocol#Out_of_band_data). That is very good solution and I already implemented it under java (sendUrgentData), but so far I was unable to find such solution for C#. Can you try to solve this issue under C# using hidden data (out_of_band_data)? Is there methods (or approaches) for sending hidden data under C#?

    Thank you for your responses on this matter. I hope I helped you to find good solution for checking out terminated connections.

    BR
    pecko

    ReplyDelete
  7. @pecko: OOB data can be sent by passing Socket.OutOfBand to one of the Send methods.

    ReplyDelete
  8. So can I ask what's involved in implementing solution number 1?

    Does the sender just send out a 0 length keepalive message if say, no data has been sent for 45s. And then the reader knows that if it hasn't received any data for, say 60s then the link must be dead and it should close it?

    Or am I off the mark.

    ReplyDelete
  9. @Oilspill: I usually have *both* sides send a 0-length keepalive message if no other data has been sent for a given period of time (or if you want it simpler, just always send out a 0-length keepalive message periodically).

    There's no need for a timeout to detect the lost connection; if the connection has been lost, then the keepalive message will fail (because it was never ACKed), and the socket will go into an error state.

    ReplyDelete
  10. Hello,
    I have an issue with my client(multiple clients)-server SMS application. Initially all clients connected properly...But after some times some clients raises as Socket exception and loses connection, if such exception occurs I restart the client it shows me connected when client sends any frame client doesn't get any ack from server, I thing this as Half open connection issue.But I am unable to find any solution. Any help will be appropriated.

    ReplyDelete
    Replies
    1. Let me know if any one want more details or code.

      Delete
    2. It sounds like the clients are detecting it but the server is not. Both sides need to detect the half-open situation. Is the server periodically sending keepalives or data to all clients?

      Delete
  11. No server is not sending any data to clients, but yes keep alive property is set(We are using C#.Net TCP/IP Lib) and I think this property does the work to check client is alive or not.
    After more debugging I found that there is issue in thread(Communicator,Transmitor, Receiver) synchronization I have resolve this its working and now testing it.
    If this not work then I think Server needs to send empty frames periodically & check clients are running or not.
    Please make me right if I wrong.
    Thanks for the your response. I will let you know about the result.

    ReplyDelete
    Replies
    1. The keepalive property will check the connection periodically. Unfortunately, it only checks every 2 hours, as mentioned in my blog post above.

      Delete
  12. Hi Stephen,
    I want to implement the TCP KeepAlive property to my TCP Client.
    I have this in the Load_form ( i am using a windows form)
    tcpSocket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, 1)
    Should i need anything else?
    Can you give me a code snipped by implementing the TCP KeepAlive(C# or Vb.net)?

    ReplyDelete
    Replies
    1. As I pointed out in my blog, the SocketOptionName.KeepAlive sets a 2-hour timeout on the socket. If that's acceptable for your app, then that's all you need.

      Most apps prefer to detect dropped connections sooner than 2 hours. In that case, you need to pass IOControlCode.KeepAliveValues to Socket.IOControl with an unmanaged SIO_KEEPALIVE_VALS structure.

      I do not have a code sample, sorry!

      Delete
    2. Thanks Stephen,
      I have found on the internet a procedure for setting KeepAlive values but i am wondering how to use it. My application is a TCP Client that connects to a server and receives messages...from time to time I want to know if the server is still alive. For the moment i have implement in the MainForm_Load sub (VB.NET):

      tcpSocket = New Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
      tcpSocket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, True)
      SetKeepAlive(tcpSocket, keepalivetime, keepaliveinterval)

      Also I am wondering how my loop that reads data is announced that the server is OFF, by TCP KeepAlive signal?

      Delete
    3. When keep-alives detect a dropped connection, any current read or write will finish with an exception. Also, the socket will be placed into an error state, so any future reads or writes will also throw exceptions.

      I think. :) I've never actually used keep-alives.

      Delete
  13. Thanks Stephen,
    I have also used a hearbeat function (IsTcpConnected) for checking the connection state.
    If I use TCP KeepAlive, should I also use this function, or it's better to be omitted?
    Thanks!

    ReplyDelete
    Replies
    1. You only *need* to check the connection using one method. Although you *could* use both if you wanted to - it won't mess anything up.

      Delete
  14. What was the solution to the "True Story"? I think I have somone fighting the same or simular brain dead device as the "bridge" device you described.

    ReplyDelete
    Replies
    1. The vendor's firmware would only accept one connection at a time. However, instead of accepting new connections and disposing of the existing connection, they just refused new connections. They did not want to change this behavior, and their embedded IP stack did not support keepalives (back then, most of them did not).

      At our insistence, the vendor implemented a second listener. So, when we detected a half-open situation (where we did not have a connection but the bridge refused new connections), we could connect to a second special port number. I don't remember if we had to send a special message over that second connection or not. The vendor's "special" server would then kill their regular server, and we could (after a short delay) connect to the bridge normally again.

      This was not the best solution (it's still technically possible for a half-open situation to exist for *both* the bridge server and "special" server), and the workaround was annoying, but in testing it worked well enough to ship to the field.

      Delete
  15. Stephen,
    To your knowledge, is solution #4 (keepalive packets), supported in modern Linux systems? I'm running on Linux kernel 3.1.10 and I'm trying to modify keepalive settings both through setsockopt() in C++ code and via the files in the /proc/sys/net/ipv4 directory, and none of it seems to be working. I'm wondering if this is the case in which keepalive simply isn't supported. Thanks, Chance

    ReplyDelete
    Replies
    1. I haven't done Linux socket programming since the late '90s, but it should certainly support keepalives.

      The default timeout is 2 hours, just like Windows. Windows permits per-connection overrides of this timeout (by sending a SIO_KEEPALIVE_VALS IOCTL). I believe the Linux equivalents are TCP_KEEPIDLE, TCP_KEEPCNT, and TCP_KEEPINTVL (sent as IPPROTO_TCP socket options).

      Delete
  16. Hi Stephan,

    What is your take on UDP socket. Even connected, UDP Sockets can be seems to be like half-open and half-closed sockets at both ends.

    So how do you get to know, if the other side is still alive and the connection is intact.

    ReplyDelete
    Replies
    1. UDP is at a much lower level than TCP. Technically, there is no actual "connection"; when the API docs talk about a "connection" all that really means is that you're setting local default values for future operations.

      So it really doesn't make sense to check for a "connection". However, you can send a keepalive to the other side to see if it's still there, just like TCP. But if the "keepalive" doesn't make it, there is no built-in NAK or notification that it failed. And UDP has no built-in retries, so if a keepalive fails it doesn't *necessarily* mean that the other side isn't there.

      Delete
  17. Hi Stephen, Thanks for the article.

    If I detect something like this on my Cisco 3945 router, does that indicate an attack and how would one resolve it.

    Mar 24 04:34:50.428 UTC: %FW-4-HOST_TCP_ALERT_ON: Max tcp half-open connections (1000) exceeded for host x.x.x.x (Public IP)
    Mar 24 04:35:12.132 UTC: %FW-4-HOST_TCP_ALERT_ON: Max tcp half-open connections (1000) exceeded for host x.x.x.x (Public IP)
    Mar 24 04:35:15.356 UTC: %FW-4-ALERT_ON: getting aggressive, count (1017/3600) current 1-min rate: 2401
    Mar 24 04:41:33.802 UTC: %FW-4-ALERT_OFF: calming down, count (23/2900) current 1-min rate: 1536
    Mar 24 04:49:51.957 UTC: %FW-4-ALERT_ON: getting aggressive, count (27/3600) current 1-min rate: 2401
    Mar 24 04:50:17.394 UTC: %FW-4-ALERT_OFF: calming down, count (11/2900) current 1-min rate: 1410

    Thanks

    ReplyDelete
    Replies
    1. That's not really my area of expertise, but it does look like it could be an attack. I recommend shutting down all incoming access from the Internet completely, including ping responses.

      Delete