<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/include/linux/sunrpc/svc.h, branch v2.6.31</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>nfs41: sunrpc: add a struct svc_xprt pointer to struct svc_serv for backchannel use</title>
<updated>2009-06-17T21:11:31+00:00</updated>
<author>
<name>Andy Adamson</name>
<email>andros@netapp.com</email>
</author>
<published>2009-04-01T13:23:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=9c9f3f5fa62cc4959e4d4d1cf1ec74f2d6ac1197'/>
<id>9c9f3f5fa62cc4959e4d4d1cf1ec74f2d6ac1197</id>
<content type='text'>
This svc_xprt is passed on to the callback service thread to be later used
to processes incoming svc_rqst's

Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This svc_xprt is passed on to the callback service thread to be later used
to processes incoming svc_rqst's

Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs41: Backchannel bc_svc_process()</title>
<updated>2009-06-17T21:11:29+00:00</updated>
<author>
<name>Ricardo Labiaga</name>
<email>Ricardo.Labiaga@netapp.com</email>
</author>
<published>2009-04-01T13:23:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4d6bbb6233c9cf23822a2f66f8470c9f40854b77'/>
<id>4d6bbb6233c9cf23822a2f66f8470c9f40854b77</id>
<content type='text'>
Implement the NFSv4.1 backchannel service.  Invokes the common callback
processing logic svc_process_common() to authenticate the call and
dispatch the appropriate NFSv4.1 XDR decoder and operation procedure.
It then invokes bc_send() to send the reply over the same connection.
bc_send() is implemented in a separate patch.

At this time there is no slot validation or reply cache handling.

[nfs41: Preallocate rpc_rqst receive buffer for handling callbacks]
Signed-off-by: Ricardo Labiaga &lt;Ricardo.Labiaga@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[Move bc_svc_process() declaration to correct patch]
Signed-off-by: Ricardo Labiaga &lt;Ricardo.Labiaga@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Implement the NFSv4.1 backchannel service.  Invokes the common callback
processing logic svc_process_common() to authenticate the call and
dispatch the appropriate NFSv4.1 XDR decoder and operation procedure.
It then invokes bc_send() to send the reply over the same connection.
bc_send() is implemented in a separate patch.

At this time there is no slot validation or reply cache handling.

[nfs41: Preallocate rpc_rqst receive buffer for handling callbacks]
Signed-off-by: Ricardo Labiaga &lt;Ricardo.Labiaga@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[Move bc_svc_process() declaration to correct patch]
Signed-off-by: Ricardo Labiaga &lt;Ricardo.Labiaga@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfs41: client callback structures</title>
<updated>2009-06-17T20:06:13+00:00</updated>
<author>
<name>Ricardo Labiaga</name>
<email>Ricardo.Labiaga@netapp.com</email>
</author>
<published>2009-04-01T13:22:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=56632b5bff5af10eb12d7e9499b5ffcadcb7a7b2'/>
<id>56632b5bff5af10eb12d7e9499b5ffcadcb7a7b2</id>
<content type='text'>
Adds new list of rpc_xprt structures, and a readers/writers lock to
protect the list.  The list is used to preallocate resources for
the backchannel during backchannel requests.  Callbacks are not
expected to cause significant latency, so only one callback will
be allowed at this time.

It also adds a pointer to the NFS callback service so that
requests can be directed to it for processing.

New callback members added to svc_serv. The NFSv4.1 callback service will
sleep on the svc_serv-&gt;svc_cb_waitq until new callback requests arrive.
The request will be queued in svc_serv-&gt;svc_cb_list. This patch adds this
list, the sleep queue and spinlock to svc_serv.

[nfs41: NFSv4.1 callback support]
Signed-off-by: Ricardo Labiaga &lt;ricardo.labiaga@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Adds new list of rpc_xprt structures, and a readers/writers lock to
protect the list.  The list is used to preallocate resources for
the backchannel during backchannel requests.  Callbacks are not
expected to cause significant latency, so only one callback will
be allowed at this time.

It also adds a pointer to the NFS callback service so that
requests can be directed to it for processing.

New callback members added to svc_serv. The NFSv4.1 callback service will
sleep on the svc_serv-&gt;svc_cb_waitq until new callback requests arrive.
The request will be queued in svc_serv-&gt;svc_cb_list. This patch adds this
list, the sleep queue and spinlock to svc_serv.

[nfs41: NFSv4.1 callback support]
Signed-off-by: Ricardo Labiaga &lt;ricardo.labiaga@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux</title>
<updated>2009-04-06T20:25:56+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2009-04-06T20:25:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a63856252d2112e7c452696037a86ceb12f47f80'/>
<id>a63856252d2112e7c452696037a86ceb12f47f80</id>
<content type='text'>
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits)
  nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4
  nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
  nfsd41: Documentation/filesystems/nfs41-server.txt
  nfsd41: CREATE_EXCLUSIVE4_1
  nfsd41: SUPPATTR_EXCLCREAT attribute
  nfsd41: support for 3-word long attribute bitmask
  nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
  nfsd41: pass writable attrs mask to nfsd4_decode_fattr
  nfsd41: provide support for minor version 1 at rpc level
  nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
  nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap
  nfsd41: access_valid
  nfsd41: clientid handling
  nfsd41: check encode size for sessions maxresponse cached
  nfsd41: stateid handling
  nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op
  nfsd41: destroy_session operation
  nfsd41: non-page DRC for solo sequence responses
  nfsd41: Add a create session replay cache
  nfsd41: create_session operation
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits)
  nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4
  nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
  nfsd41: Documentation/filesystems/nfs41-server.txt
  nfsd41: CREATE_EXCLUSIVE4_1
  nfsd41: SUPPATTR_EXCLCREAT attribute
  nfsd41: support for 3-word long attribute bitmask
  nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
  nfsd41: pass writable attrs mask to nfsd4_decode_fattr
  nfsd41: provide support for minor version 1 at rpc level
  nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
  nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap
  nfsd41: access_valid
  nfsd41: clientid handling
  nfsd41: check encode size for sessions maxresponse cached
  nfsd41: stateid handling
  nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op
  nfsd41: destroy_session operation
  nfsd41: non-page DRC for solo sequence responses
  nfsd41: Add a create session replay cache
  nfsd41: create_session operation
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>nfsd41: hard page limit for DRC</title>
<updated>2009-04-04T00:41:17+00:00</updated>
<author>
<name>Andy Adamson</name>
<email>andros@netapp.com</email>
</author>
<published>2009-04-03T05:28:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=c3d06f9ce8544fecfe13e377d1e2c2e47fe18dbc'/>
<id>c3d06f9ce8544fecfe13e377d1e2c2e47fe18dbc</id>
<content type='text'>
Use no more than 1/128th of the number of free pages at nfsd startup for the
v4.1 DRC.

This is an arbitrary default which should probably end up under the control
of an administrator.

Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
[moved added fields in struct svc_serv under CONFIG_NFSD_V4_1]
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[fix set_max_drc calculation of sv_drc_max_pages]
[moved NFSD_DRC_SIZE_SHIFT's declaration up in header file]
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use no more than 1/128th of the number of free pages at nfsd startup for the
v4.1 DRC.

This is an arbitrary default which should probably end up under the control
of an administrator.

Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
[moved added fields in struct svc_serv under CONFIG_NFSD_V4_1]
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[fix set_max_drc calculation of sv_drc_max_pages]
[moved NFSD_DRC_SIZE_SHIFT's declaration up in header file]
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>nfsd: don't use the deferral service, return NFS4ERR_DELAY</title>
<updated>2009-04-04T00:41:12+00:00</updated>
<author>
<name>Andy Adamson</name>
<email>andros@netapp.com</email>
</author>
<published>2009-04-03T05:27:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2f425878b6a71571341dcd3f9e9d1a6f6355da9c'/>
<id>2f425878b6a71571341dcd3f9e9d1a6f6355da9c</id>
<content type='text'>
On an NFSv4.1 server cache miss that causes an upcall, NFS4ERR_DELAY will be
returned. It is up to the NFSv4.1 client to resend only the operations that
have not been processed.

Initialize rq_usedeferral to 1 in svc_process(). It sill be turned off in
nfsd4_proc_compound() only when NFSv4.1 Sessions are used.

Note: this isn't an adequate solution on its own. It's acceptable as a way
to get some minimal 4.1 up and working, but we're going to have to find a
way to avoid returning DELAY in all common cases before 4.1 can really be
considered ready.

Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[nfsd41: reverse rq_nodeferral negative logic]
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[sunrpc: initialize rq_usedeferral]
Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
On an NFSv4.1 server cache miss that causes an upcall, NFS4ERR_DELAY will be
returned. It is up to the NFSv4.1 client to resend only the operations that
have not been processed.

Initialize rq_usedeferral to 1 in svc_process(). It sill be turned off in
nfsd4_proc_compound() only when NFSv4.1 Sessions are used.

Note: this isn't an adequate solution on its own. It's acceptable as a way
to get some minimal 4.1 up and working, but we're going to have to find a
way to avoid returning DELAY in all common cases before 4.1 can really be
considered ready.

Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[nfsd41: reverse rq_nodeferral negative logic]
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
[sunrpc: initialize rq_usedeferral]
Signed-off-by: Andy Adamson &lt;andros@netapp.com&gt;
Signed-off-by: Benny Halevy &lt;bhalevy@panasas.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>SUNRPC: Remove @family argument from svc_create() and svc_create_pooled()</title>
<updated>2009-03-28T19:54:48+00:00</updated>
<author>
<name>Chuck Lever</name>
<email>chuck.lever@oracle.com</email>
</author>
<published>2009-03-19T00:46:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=49a9072f29a1039f142ec98b44a72d7173651c02'/>
<id>49a9072f29a1039f142ec98b44a72d7173651c02</id>
<content type='text'>
Since an RPC service listener's protocol family is specified now via
svc_create_xprt(), it no longer needs to be passed to svc_create() or
svc_create_pooled().  Remove that argument from the synopsis of those
functions, and remove the sv_family field from the svc_serv struct.

Signed-off-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since an RPC service listener's protocol family is specified now via
svc_create_xprt(), it no longer needs to be passed to svc_create() or
svc_create_pooled().  Remove that argument from the synopsis of those
functions, and remove the sv_family field from the svc_serv struct.

Signed-off-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>SUNRPC: Pass a family argument to svc_register()</title>
<updated>2009-03-28T19:54:12+00:00</updated>
<author>
<name>Chuck Lever</name>
<email>chuck.lever@oracle.com</email>
</author>
<published>2009-03-19T00:46:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4b62e58cccff9c5e7ffc7023f7ec24c75fbd549b'/>
<id>4b62e58cccff9c5e7ffc7023f7ec24c75fbd549b</id>
<content type='text'>
The sv_family field is going away.  Instead of using sv_family, have
the svc_register() function take a protocol family argument.

Since this argument represents a protocol family, and not an address
family, this argument takes an int, as this is what is passed to
sock_create_kern().  Also make sure svc_register's helpers are
checking for PF_FOO instead of AF_FOO.  The value of [AP]F_FOO are
equivalent; this is simply a symbolic change to reflect the semantics
of the value stored in that variable.

sock_create_kern() should return EPFNOSUPPORT if the passed-in
protocol family isn't supported, but it uses EAFNOSUPPORT for this
case.  We will stick with that tradition here, as svc_register()
is called by the RPC server in the same path as sock_create_kern().

Signed-off-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The sv_family field is going away.  Instead of using sv_family, have
the svc_register() function take a protocol family argument.

Since this argument represents a protocol family, and not an address
family, this argument takes an int, as this is what is passed to
sock_create_kern().  Also make sure svc_register's helpers are
checking for PF_FOO instead of AF_FOO.  The value of [AP]F_FOO are
equivalent; this is simply a symbolic change to reflect the semantics
of the value stored in that variable.

sock_create_kern() should return EPFNOSUPPORT if the passed-in
protocol family isn't supported, but it uses EAFNOSUPPORT for this
case.  We will stick with that tradition here, as svc_register()
is called by the RPC server in the same path as sock_create_kern().

Signed-off-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: Trond Myklebust &lt;Trond.Myklebust@netapp.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>knfsd: add file to export stats about nfsd pools</title>
<updated>2009-03-18T21:38:42+00:00</updated>
<author>
<name>Greg Banks</name>
<email>gnb@sgi.com</email>
</author>
<published>2009-01-13T10:26:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=03cf6c9f49a8fea953d38648d016e3f46e814991'/>
<id>03cf6c9f49a8fea953d38648d016e3f46e814991</id>
<content type='text'>
Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.

This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:

http://article.gmane.org/gmane.linux.nfs/10375

It has also been updated thus:

 * moved EXPORT_SYMBOL() to near the function it exports
 * made the new struct struct seq_operations const
 * used SEQ_START_TOKEN instead of ((void *)1)
 * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
   printk in svc_pool_stats_*()" by Harshula Jayasuriya.
 * merged fix from SGI PV 964001 "Crash reading pool_stats before
   nfsds are started".

Signed-off-by: Greg Banks &lt;gnb@sgi.com&gt;
Signed-off-by: Harshula Jayasuriya &lt;harshula@sgi.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.

This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:

http://article.gmane.org/gmane.linux.nfs/10375

It has also been updated thus:

 * moved EXPORT_SYMBOL() to near the function it exports
 * made the new struct struct seq_operations const
 * used SEQ_START_TOKEN instead of ((void *)1)
 * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
   printk in svc_pool_stats_*()" by Harshula Jayasuriya.
 * merged fix from SGI PV 964001 "Crash reading pool_stats before
   nfsds are started".

Signed-off-by: Greg Banks &lt;gnb@sgi.com&gt;
Signed-off-by: Harshula Jayasuriya &lt;harshula@sgi.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>knfsd: avoid overloading the CPU scheduler with enormous load averages</title>
<updated>2009-03-18T21:38:41+00:00</updated>
<author>
<name>Greg Banks</name>
<email>gnb@sgi.com</email>
</author>
<published>2009-01-13T10:26:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=59a252ff8c0f2fa32c896f69d56ae33e641ce7ad'/>
<id>59a252ff8c0f2fa32c896f69d56ae33e641ce7ad</id>
<content type='text'>
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads.  When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up.  As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them.  Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run.  This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
   the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
   of CPU time, to the point where daemons like portmap and rpc.mountd
   do not schedule for tens of seconds at a time.  Clients attempting
   to mount an NFS filesystem timeout at the very first step (opening
   a TCP connection to portmap) because portmap cannot wake up from
   select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server.  That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server.  The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks &lt;gnb@sgi.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads.  When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up.  As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them.  Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run.  This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
   the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
   of CPU time, to the point where daemons like portmap and rpc.mountd
   do not schedule for tens of seconds at a time.  Clients attempting
   to mount an NFS filesystem timeout at the very first step (opening
   a TCP connection to portmap) because portmap cannot wake up from
   select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server.  That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server.  The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks &lt;gnb@sgi.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@citi.umich.edu&gt;
</pre>
</div>
</content>
</entry>
</feed>
