Dear socks5 community!

It seems, that socks5 reference implementation (at least  from)  version
1.0r10 including the current version 1.0r11 has a undesired behaviour in
threaded mode in heavy  loaded  environments.  With  version  1.0r11  an
additional problem regarding the signal handling can be observed.


We use Solaris 2.6 with Socks5 1.0r10, now switched to 1.0r11 running in
threaded mode. The load is around 178000 connections per day.


Description and Analysis Bug 1
==============================

Typical symtom:

Nov  3 18:06:30 dax Socks5[24817]: 000001: server exiting: fork failed
Nov  3 18:06:31 dax Socks5[24817]: 000001: Socks5 Exiting at: Fri Nov 03 18:06:30 2000
Nov  3 18:07:01 dax Socks5[18725]: 000001: Socks5 starting at Fri Nov 03 18:07:01 2000 in threading mode

This  termination  of  the  socks5  servers  depends  on  the  incomming
connection rate and varies from 1h to a day.

The logfile states a fork problem which is definitly not the origin  for
this behaviour!

Intensive observations w/o selective debugging information (debug  level
built into the implemention are nice but no helpful because one gets  to
much information which breaks down the  syslog  daemon  and  the  server
performance (even if logging goes directly to file) uncovered  following
coherences:

The server is started in threaded mode with 16 server  processes  and  a
thread-limit of 64 threads per process (actually 63 worker threads  with
one acceptor/creator thread).

I'm refering  to  the  source  code  in  server/socket.c  and  later  in
include/sigfix.h, too.

A short introduction into the function of a threaded environment:

There is an master process which forks slave processes on  demand  (max.
nservers = NUMCLIENTS/4, default 16).  Each  slave  process  may  create
nthreads (minimum of MAXOPENS/4 and MAXTHREADS-1, 63). Only one process,
which has been started at last take the role  of  the  "acceptor".  This
process is the only which creates threads and signals the master process
to create a new slave process if all threads (in the  acceptor  process)
are busy.

All other non acceptor slave processes and theire threads remain active.
If all threads of a slave have exited the slave process terminates  too.
A (worker) thread terminates right after  it  completes  its  connection
handling or the wait in the connection accept exceeds 5 minutes.

The acceptor slave process has a slightly other structure and  behaviour
regarding its threads. The initial thread is  so  called  "main  thread"
which does the acceptor work (creating threads, signaling the master for
a new process). As long as the main thread is active all threads proceed
to accept new  connections  without  terminating  immediatly  after  the
work has been done. Only the accept timeout still terminates a thread.

In this scenario some problems may arise:

The number of slave processes and their containing threads  are  limited
(maximum of 16x64 concurrent connection). The above strategy is not  bad
if one assumes that connection durations are (very) short. This  is  not
always true, especially in environments with higher connection rates.

The worst case would be the situation, where in each slave  process  all
threads has  already  terminated  but  only  one  remains  (because  the
connection has a permanent character like  a  IRC  or  any  other  login
session). In this case all non-acceptor slave processes carry exactly
one connection (sum of them: 15) and only the acceptor slave process has
the full capacity of 64 threads which gives a maximum of 79 concurrent
connections (compared to the theoretical 1024).

At this point the termination bug comes into play. The maximum of
forked processes has been reached and following chain reaction happens.
I take now a closer look at server/socket.c:

1. Context 1 (MASTER): Master Process in routine GetSignals() where the master
	is waiting all the time for signal events (SigPause()).

----------------------------------------------------------------------------
        /* Wait for any signal to arrive, esp SIGCHLD and SIGHUP.  SIGHUP    */
        /* will cause a re-read of the config file, everything else: loop.   */
        if (!hadfatalsig && !hadsigint && !hadresetsig && *infd != S5InvalidIOHandle) {
            SigPause();
        }
----------------------------------------------------------------------------

2. Context 2 (SLAVE): Main Thread in the acceptor Process.
	Routine GetNetConnection, near the end, after creating a
	new thread (asuming the thread maximum has been reached).
	A check for reaching the maximum thread count leads to
	a signaling action (SIGUSR) addressed to the master process to 
	create a new slave process (acting as new acceptor) and
	terminates the current acceptor (main thread exists).
	This switches back to context 1, the master process.


----------------------------------------------------------------------------
                nconns++;
                nservers++;
                if (nconns >= nservers && nservers >= nthreads) {
                    hadresetsig = 1;
                    MUTEX_UNLOCK(conn_mutex);
                    kill(getppid(), SIGUSR1);
                    S5LogUpdate(S5LogDefaultHandle, S5_LOG_DEBUG(15), 0, "Accept: Thread(%d) is full", nservers);
                    THREAD_EXIT(0);
                } else {
                    MUTEX_UNLOCK(conn_mutex);
                    S5LogUpdate(S5LogDefaultHandle, S5_LOG_DEBUG(15), 0, "Accept: New thread(%d) created", nservers);
                }
----------------------------------------------------------------------------


3. MASTER: SIGUSR is catched and sets "hadresetsig" to 1.
	 SigPause() continues and reaches 

----------------------------------------------------------------------------
        if (hadresetsig && servermode == THREADED) acceptor = 0;

        hadfatalsig = 0;
        hadresetsig = 0;

----------------------------------------------------------------------------

	resetting "acceptor" to 0 continuing the for-loop. acceptor==0 
	triggers the DoFork() call.
	acceptor get the return value from DoFork(), which fails if the
	slave process maximum (usually 16) has been reached. Therefore acceptor
	is set to -1:

----------------------------------------------------------------------------
        /* Do our thing if everything is ok...                               */
        if (*infd != S5InvalidIOHandle) {
            switch (servermode) {
                case THREADED:
                    if (acceptor == 0) acceptor = DoFork();
                    if (iamachild) goto done;
                    break;
                case PREFORKING:
                    while (DoFork() > 0);
                    if (iamachild) goto done;
                    break;
            }
        }
	
        if (servermode == THREADED && (acceptor < 0 && errno != EAGAIN)) {
            S5LogUpdate(S5LogDefaultHandle, S5_LOG_ERROR, 0, "server exiting: fork failed");
            hadsigint = 1;
        }
----------------------------------------------------------------------------

	Since accepter is -1 and errno has  been  set  to  EAGAIN  (from
	DoFork()) no exit happens at this  time  (hadresetsig  has  been
	reset together with acceptor, see above).
	After that the Master reaches the SigPause() call and waits
	for a child to restart a new acceptor slave process which
	know does not exist (no new connection are accepted, even though
	enough thread slots could be used if kept active).
 
	As far a child terminates SigPause() terminates and the Master
	recognises that a child has terminated and starts over in for-loop.

        if (!hadfatalsig && !hadresetsig) {
            /* SIGCHLD received...                                           */
            S5LogUpdate(S5LogDefaultHandle, S5_LOG_DEBUG(15), 0, "Parent reaped? (%d child%s)", nchildren, (nchildren != 1)?"ren":"");
            continue;
        }

	But this time acceptor contains yet the value -1 from the failed DoFork() and
	errno has changed to EINTR because the  SigPause() call was interrupted.
	This time DoFork() is NOT called because acceptor is still -1 right at the
	beginning of the for-loop.
	
	Now the time to exit comes:
	acceptor is negative from to failed DoFork() and errno has the value
	EINTR from the SigPause() call -> following code lead to a exit
	with "fork failed" message!

----------------------------------------------------------------------------
        if (servermode == THREADED && (acceptor < 0 && errno != EAGAIN)) {
            S5LogUpdate(S5LogDefaultHandle, S5_LOG_ERROR, 0, "server exiting: fork failed");
            hadsigint = 1;
        }
----------------------------------------------------------------------------


Bugfix 1
========

It should be clear that acceptor must be set to 0 to get DoFork() called again.
This can be done right after the exit part with e.g.

	if (acceptor < 0) acceptor = 0;

That's all to prevent unexpected termination of the whole socks5 server.


Extension to bugfix 1
=====================

But still with the above bugfix the socks5 server performs no very well.
The roots lays in behaviour of a slave process after  it  gives  up  the
acceptor  role.  In  the  current  implementation  each  worker   thread
terminates immediatly after it has done it proxy  job.  But  some  proxy
connection stay for a long time  (e.g.  an  IRC  session)  and  a  slave
process may contain only one thread. It's just a question of time  until
the acceptor reaches the process limit (default  16)  with  slaves  just
handling some connection which blocks or delays the creation  of  a  new
acceptor process. In this exceptional phase no  new  connection  can  be
handled, so the proxy server is looking pretty dead.


A suggestion which has been tested on our production environment successfully
for serveral weeks:

The worker function DoThreadWork() in server/socket.c can modified as following:
A thread terminates ONLY if no active thread in the process exists. This is only the
case if the system has a low connection rate (usually deep in the night towards
morning). Otherwise each thread remains active and will still accept connections.
Because accept invocation is protected mutual exclusively using a system wide
semaphore variable, hundreds of thread may accept in parallel. As a sideeffect
the permanent overhead of thread and process creation is minimized too.

Main part of this idea is (on several places)

        if (nconns == 0 && hadresetsig) {
            /* exit thread only if no open connections exist */


and the acception  wrapper  handling  to  prevent  unconditional  thread
termination (after a timeout). After a  timeout  the  accept  action  is
reentered. Only a accept error terminates the worker thread.


Description and Analysis Bug 2
==============================

File refering: include/sigfix.h

Following scenario arises: The master process  starts  the  first  slave
(which is also the  initial  acceptor  process).  After  some  time  all
workers of the acceptor process has been used and a SIGUSR1 is  used  to
notify the master to create a new acceptor process. This time the signal
handler catches the signal and the master process creates  a  new  slave
process as expected. If even the second acceptor process out of  threads
the master is again signaled to create a new acceptor process,  but  the
signal terminates the master process. There is no new  acceptor  process
(no new connetion can be established) and no master process handles  the
creation of new processes! The ps command  output  typically  shows  two
socks5 processes handling the remaining connections.

Analysis:

Between r10 and r11 sigfix.h has been changed. It not clear to me
where the motivation for this changes layed.

The wrapper procedure Signal() uses the POSIX system call sigaction() to setup
a signal handler. In release 11 the flags A_RESETHAND and SA_NODEFER has
been added. This is fatal for the master-slave process communication.

   SA_RESETHAND: it prevents reinstalling the handler after the
   	a signal has been catched!
   SA_NODEFER: it blocking the caught signal while executing the
   	handler!

With this flags we get  the  uncertain  signal  behaviour  of  some  old
BSD-based systems. Critical is SA_RESETHAND, which detaches  the  signal
handler from the signal after the signal handler  has  been  called.  In
former days a BSDish signal handler has to reattach the  signal  handler
by calling signal() in the signal handler again. But the  SOCKS5  server
is not implemented to handle this correctly. Therefore a  second  signal
leads to the default behaviour, normally the termination of a process.

The  flag  SA_RESTART  has  been  added  too,  but  seems  not  to  have
sideeffects.


Bugfix 2
========

Remove SA_RESETHAND and SA_NODEFER from Signal() in sigfix.h


I would like to ask if my changed could be thrown into a next release.