Dear socks5 community! It seems, that socks5 reference implementation (at least from) version 1.0r10 including the current version 1.0r11 has a undesired behaviour in threaded mode in heavy loaded environments. With version 1.0r11 an additional problem regarding the signal handling can be observed. We use Solaris 2.6 with Socks5 1.0r10, now switched to 1.0r11 running in threaded mode. The load is around 178000 connections per day. Description and Analysis Bug 1 ============================== Typical symtom: Nov 3 18:06:30 dax Socks5[24817]: 000001: server exiting: fork failed Nov 3 18:06:31 dax Socks5[24817]: 000001: Socks5 Exiting at: Fri Nov 03 18:06:30 2000 Nov 3 18:07:01 dax Socks5[18725]: 000001: Socks5 starting at Fri Nov 03 18:07:01 2000 in threading mode This termination of the socks5 servers depends on the incomming connection rate and varies from 1h to a day. The logfile states a fork problem which is definitly not the origin for this behaviour! Intensive observations w/o selective debugging information (debug level built into the implemention are nice but no helpful because one gets to much information which breaks down the syslog daemon and the server performance (even if logging goes directly to file) uncovered following coherences: The server is started in threaded mode with 16 server processes and a thread-limit of 64 threads per process (actually 63 worker threads with one acceptor/creator thread). I'm refering to the source code in server/socket.c and later in include/sigfix.h, too. A short introduction into the function of a threaded environment: There is an master process which forks slave processes on demand (max. nservers = NUMCLIENTS/4, default 16). Each slave process may create nthreads (minimum of MAXOPENS/4 and MAXTHREADS-1, 63). Only one process, which has been started at last take the role of the "acceptor". This process is the only which creates threads and signals the master process to create a new slave process if all threads (in the acceptor process) are busy. All other non acceptor slave processes and theire threads remain active. If all threads of a slave have exited the slave process terminates too. A (worker) thread terminates right after it completes its connection handling or the wait in the connection accept exceeds 5 minutes. The acceptor slave process has a slightly other structure and behaviour regarding its threads. The initial thread is so called "main thread" which does the acceptor work (creating threads, signaling the master for a new process). As long as the main thread is active all threads proceed to accept new connections without terminating immediatly after the work has been done. Only the accept timeout still terminates a thread. In this scenario some problems may arise: The number of slave processes and their containing threads are limited (maximum of 16x64 concurrent connection). The above strategy is not bad if one assumes that connection durations are (very) short. This is not always true, especially in environments with higher connection rates. The worst case would be the situation, where in each slave process all threads has already terminated but only one remains (because the connection has a permanent character like a IRC or any other login session). In this case all non-acceptor slave processes carry exactly one connection (sum of them: 15) and only the acceptor slave process has the full capacity of 64 threads which gives a maximum of 79 concurrent connections (compared to the theoretical 1024). At this point the termination bug comes into play. The maximum of forked processes has been reached and following chain reaction happens. I take now a closer look at server/socket.c: 1. Context 1 (MASTER): Master Process in routine GetSignals() where the master is waiting all the time for signal events (SigPause()). ---------------------------------------------------------------------------- /* Wait for any signal to arrive, esp SIGCHLD and SIGHUP. SIGHUP */ /* will cause a re-read of the config file, everything else: loop. */ if (!hadfatalsig && !hadsigint && !hadresetsig && *infd != S5InvalidIOHandle) { SigPause(); } ---------------------------------------------------------------------------- 2. Context 2 (SLAVE): Main Thread in the acceptor Process. Routine GetNetConnection, near the end, after creating a new thread (asuming the thread maximum has been reached). A check for reaching the maximum thread count leads to a signaling action (SIGUSR) addressed to the master process to create a new slave process (acting as new acceptor) and terminates the current acceptor (main thread exists). This switches back to context 1, the master process. ---------------------------------------------------------------------------- nconns++; nservers++; if (nconns >= nservers && nservers >= nthreads) { hadresetsig = 1; MUTEX_UNLOCK(conn_mutex); kill(getppid(), SIGUSR1); S5LogUpdate(S5LogDefaultHandle, S5_LOG_DEBUG(15), 0, "Accept: Thread(%d) is full", nservers); THREAD_EXIT(0); } else { MUTEX_UNLOCK(conn_mutex); S5LogUpdate(S5LogDefaultHandle, S5_LOG_DEBUG(15), 0, "Accept: New thread(%d) created", nservers); } ---------------------------------------------------------------------------- 3. MASTER: SIGUSR is catched and sets "hadresetsig" to 1. SigPause() continues and reaches ---------------------------------------------------------------------------- if (hadresetsig && servermode == THREADED) acceptor = 0; hadfatalsig = 0; hadresetsig = 0; ---------------------------------------------------------------------------- resetting "acceptor" to 0 continuing the for-loop. acceptor==0 triggers the DoFork() call. acceptor get the return value from DoFork(), which fails if the slave process maximum (usually 16) has been reached. Therefore acceptor is set to -1: ---------------------------------------------------------------------------- /* Do our thing if everything is ok... */ if (*infd != S5InvalidIOHandle) { switch (servermode) { case THREADED: if (acceptor == 0) acceptor = DoFork(); if (iamachild) goto done; break; case PREFORKING: while (DoFork() > 0); if (iamachild) goto done; break; } } if (servermode == THREADED && (acceptor < 0 && errno != EAGAIN)) { S5LogUpdate(S5LogDefaultHandle, S5_LOG_ERROR, 0, "server exiting: fork failed"); hadsigint = 1; } ---------------------------------------------------------------------------- Since accepter is -1 and errno has been set to EAGAIN (from DoFork()) no exit happens at this time (hadresetsig has been reset together with acceptor, see above). After that the Master reaches the SigPause() call and waits for a child to restart a new acceptor slave process which know does not exist (no new connection are accepted, even though enough thread slots could be used if kept active). As far a child terminates SigPause() terminates and the Master recognises that a child has terminated and starts over in for-loop. if (!hadfatalsig && !hadresetsig) { /* SIGCHLD received... */ S5LogUpdate(S5LogDefaultHandle, S5_LOG_DEBUG(15), 0, "Parent reaped? (%d child%s)", nchildren, (nchildren != 1)?"ren":""); continue; } But this time acceptor contains yet the value -1 from the failed DoFork() and errno has changed to EINTR because the SigPause() call was interrupted. This time DoFork() is NOT called because acceptor is still -1 right at the beginning of the for-loop. Now the time to exit comes: acceptor is negative from to failed DoFork() and errno has the value EINTR from the SigPause() call -> following code lead to a exit with "fork failed" message! ---------------------------------------------------------------------------- if (servermode == THREADED && (acceptor < 0 && errno != EAGAIN)) { S5LogUpdate(S5LogDefaultHandle, S5_LOG_ERROR, 0, "server exiting: fork failed"); hadsigint = 1; } ---------------------------------------------------------------------------- Bugfix 1 ======== It should be clear that acceptor must be set to 0 to get DoFork() called again. This can be done right after the exit part with e.g. if (acceptor < 0) acceptor = 0; That's all to prevent unexpected termination of the whole socks5 server. Extension to bugfix 1 ===================== But still with the above bugfix the socks5 server performs no very well. The roots lays in behaviour of a slave process after it gives up the acceptor role. In the current implementation each worker thread terminates immediatly after it has done it proxy job. But some proxy connection stay for a long time (e.g. an IRC session) and a slave process may contain only one thread. It's just a question of time until the acceptor reaches the process limit (default 16) with slaves just handling some connection which blocks or delays the creation of a new acceptor process. In this exceptional phase no new connection can be handled, so the proxy server is looking pretty dead. A suggestion which has been tested on our production environment successfully for serveral weeks: The worker function DoThreadWork() in server/socket.c can modified as following: A thread terminates ONLY if no active thread in the process exists. This is only the case if the system has a low connection rate (usually deep in the night towards morning). Otherwise each thread remains active and will still accept connections. Because accept invocation is protected mutual exclusively using a system wide semaphore variable, hundreds of thread may accept in parallel. As a sideeffect the permanent overhead of thread and process creation is minimized too. Main part of this idea is (on several places) if (nconns == 0 && hadresetsig) { /* exit thread only if no open connections exist */ and the acception wrapper handling to prevent unconditional thread termination (after a timeout). After a timeout the accept action is reentered. Only a accept error terminates the worker thread. Description and Analysis Bug 2 ============================== File refering: include/sigfix.h Following scenario arises: The master process starts the first slave (which is also the initial acceptor process). After some time all workers of the acceptor process has been used and a SIGUSR1 is used to notify the master to create a new acceptor process. This time the signal handler catches the signal and the master process creates a new slave process as expected. If even the second acceptor process out of threads the master is again signaled to create a new acceptor process, but the signal terminates the master process. There is no new acceptor process (no new connetion can be established) and no master process handles the creation of new processes! The ps command output typically shows two socks5 processes handling the remaining connections. Analysis: Between r10 and r11 sigfix.h has been changed. It not clear to me where the motivation for this changes layed. The wrapper procedure Signal() uses the POSIX system call sigaction() to setup a signal handler. In release 11 the flags A_RESETHAND and SA_NODEFER has been added. This is fatal for the master-slave process communication. SA_RESETHAND: it prevents reinstalling the handler after the a signal has been catched! SA_NODEFER: it blocking the caught signal while executing the handler! With this flags we get the uncertain signal behaviour of some old BSD-based systems. Critical is SA_RESETHAND, which detaches the signal handler from the signal after the signal handler has been called. In former days a BSDish signal handler has to reattach the signal handler by calling signal() in the signal handler again. But the SOCKS5 server is not implemented to handle this correctly. Therefore a second signal leads to the default behaviour, normally the termination of a process. The flag SA_RESTART has been added too, but seems not to have sideeffects. Bugfix 2 ======== Remove SA_RESETHAND and SA_NODEFER from Signal() in sigfix.h I would like to ask if my changed could be thrown into a next release.