linux.git/net, branch v3.6-rc4

Merge branch 'for-3.6' of git://linux-nfs.org/~bfields/linux

2012-08-25T18:43:41+00:00

Pull nfsd bugfixes from J. Bruce Fields:
 "Particular thanks to Michael Tokarev, Malahal Naineni, and Jamie
  Heilman for their testing and debugging help."

* 'for-3.6' of git://linux-nfs.org/~bfields/linux:
  svcrpc: fix svc_xprt_enqueue/svc_recv busy-looping
  svcrpc: sends on closed socket should stop immediately
  svcrpc: fix BUG() in svc_tcp_clear_pages
  nfsd4: fix security flavor of NFSv4.0 callback

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client

2012-08-22T16:58:05+00:00

Pull ceph fixes from Sage Weil:
 "Jim's fix closes a narrow race introduced with the msgr changes.  One
  fix resolves problems with debugfs initialization that Yan found when
  multiple client instances are created (e.g., two clusters mounted, or
  rbd + cephfs), another one fixes problems with mounting a nonexistent
  server subdirectory, and the last one fixes a divide by zero error
  from unsanitized ioctl input that Dan Carpenter found."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: avoid divide by zero in __validate_layout()
  libceph: avoid truncation due to racing banners
  ceph: tolerate (and warn on) extraneous dentry from mds
  libceph: delay debugfs initialization until we learn global_id

libceph: avoid truncation due to racing banners

2012-08-21T22:55:27+00:00

Because the Ceph client messenger uses a non-blocking connect, it is
possible for the sending of the client banner to race with the
arrival of the banner sent by the peer.

When ceph_sock_state_change() notices the connect has completed, it
schedules work to process the socket via con_work().  During this
time the peer is writing its banner, and arrival of the peer banner
races with con_work().

If con_work() calls try_read() before the peer banner arrives, there
is nothing for it to do, after which con_work() calls try_write() to
send the client's banner.  In this case Ceph's protocol negotiation
can complete succesfully.

The server-side messenger immediately sends its banner and addresses
after accepting a connect request, *before* actually attempting to
read or verify the banner from the client.  As a result, it is
possible for the banner from the server to arrive before con_work()
calls try_read().  If that happens, try_read() will read the banner
and prepare protocol negotiation info via prepare_write_connect().
prepare_write_connect() calls con_out_kvec_reset(), which discards
the as-yet-unsent client banner.  Next, con_work() calls
try_write(), which sends the protocol negotiation info rather than
the banner that the peer is expecting.

The result is that the peer sees an invalid banner, and the client
reports "negotiation failed".

Fix this by moving con_out_kvec_reset() out of
prepare_write_connect() to its callers at all locations except the
one where the banner might still need to be sent.

[elder@inktak.com: added note about server-side behavior]

Signed-off-by: Jim Schutt 
Reviewed-by: Alex Elder

af_netlink: force credentials passing [CVE-2012-3520]

2012-08-21T21:53:01+00:00

Pablo Neira Ayuso discovered that avahi and
potentially NetworkManager accept spoofed Netlink messages because of a
kernel bug.  The kernel passes all-zero SCM_CREDENTIALS ancillary data
to the receiver if the sender did not provide such data, instead of not
including any such data at all or including the correct data from the
peer (as it is the case with AF_UNIX).

This bug was introduced in commit 16e572626961
(af_unix: dont send SCM_CREDENTIALS by default)

This patch forces passing credentials for netlink, as
before the regression.

Another fix would be to not add SCM_CREDENTIALS in
netlink messages if not provided by the sender, but it
might break some programs.

With help from Florian Weimer & Petr Matousek

This issue is designated as CVE-2012-3520

Signed-off-by: Eric Dumazet 
Cc: Petr Matousek 
Cc: Florian Weimer 
Cc: Pablo Neira Ayuso 
Signed-off-by: David S. Miller

ipv4: fix ip header ident selection in __ip_make_skb()

2012-08-21T21:51:06+00:00

Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
memory in __ip_select_ident().

It turns out that __ip_make_skb() called ip_select_ident() before
properly initializing iph->daddr.

This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of
cached route inetpeer.)

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131

Reported-by: Christian Casteyde 
Signed-off-by: Eric Dumazet 
Cc: Stephen Hemminger 
Signed-off-by: David S. Miller

ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()

2012-08-21T21:49:11+00:00

Since 0e734419923bd ("ipv4: Use inet_csk_route_child_sock() in DCCP and
TCP."), inet_csk_route_child_sock() is called instead of
inet_csk_route_req().

However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
Thus, inside inet_csk_route_child_sock() opt is always NULL and the
SRR-options are not respected anymore.
Packets sent by the server won't have the correct destination-IP.

This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
inside inet_csk_route_child_sock().

Reported-by: Luca Boccassi 
Signed-off-by: Christoph Paasch 
Signed-off-by: David S. Miller

tcp: fix possible socket refcount problem

2012-08-21T21:42:23+00:00

Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :

[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [] process_one_work+0x1de/0x47f
[ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [] process_one_work+0x1de/0x47f
[ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [] tcp_sendmsg+0x29/0xcc6
[ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281]    [] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281]  [] ? __sk_free+0xfd/0x114
[ 2866.132281]  [] kmem_cache_free+0x6b/0x13a
[ 2866.132281]  [] __sk_free+0xfd/0x114
[ 2866.132281]  [] sk_free+0x1c/0x1e
[ 2866.132281]  [] tcp_write_timer+0x51/0x56
[ 2866.132281]  [] run_timer_softirq+0x218/0x35f
[ 2866.132281]  [] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281]  [] ? rb_commit+0x58/0x85
[ 2866.132281]  [] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281]  [] __do_softirq+0xcb/0x1f9
[ 2866.132281]  [] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281]  [] call_softirq+0x1c/0x30
[ 2866.132281]  [] do_softirq+0x4a/0xa6
[ 2866.132281]  [] irq_exit+0x51/0xad
[ 2866.132281]  [] do_IRQ+0x9d/0xb4
[ 2866.132281]  [] common_interrupt+0x6f/0x6f
[ 2866.132281]    [] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281]  [] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281]  [] mod_timer+0x178/0x1a9
[ 2866.132281]  [] sk_reset_timer+0x19/0x26
[ 2866.132281]  [] tcp_rearm_rto+0x99/0xa4
[ 2866.132281]  [] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281]  [] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281]  [] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281]  [] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281]  [] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281]  [] inet_sendmsg+0xaa/0xd5
[ 2866.132281]  [] ? inet_autobind+0x5f/0x5f
[ 2866.132281]  [] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [] sock_sendmsg+0xa3/0xc4
[ 2866.132281]  [] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281]  [] ? native_sched_clock+0x29/0x6f
[ 2866.132281]  [] ? sched_clock+0x9/0xd
[ 2866.132281]  [] ? trace_clock_local+0x9/0xb
[ 2866.132281]  [] kernel_sendmsg+0x37/0x43
[ 2866.132281]  [] xs_send_kvec+0x77/0x80
[ 2866.132281]  [] xs_sendpages+0x6f/0x1a0
[ 2866.132281]  [] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281]  [] xs_tcp_send_request+0x55/0xf1
[ 2866.132281]  [] xprt_transmit+0x89/0x1db
[ 2866.132281]  [] ? call_connect+0x3c/0x3c
[ 2866.132281]  [] call_transmit+0x1c5/0x20e
[ 2866.132281]  [] __rpc_execute+0x6f/0x225
[ 2866.132281]  [] ? call_connect+0x3c/0x3c
[ 2866.132281]  [] rpc_async_schedule+0x28/0x34
[ 2866.132281]  [] process_one_work+0x24d/0x47f
[ 2866.132281]  [] ? process_one_work+0x1de/0x47f
[ 2866.132281]  [] ? __rpc_execute+0x225/0x225
[ 2866.132281]  [] worker_thread+0x236/0x317
[ 2866.132281]  [] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281]  [] kthread+0x9a/0xa2
[ 2866.132281]  [] kernel_thread_helper+0x4/0x10
[ 2866.132281]  [] ? retint_restore_args+0x13/0x13
[ 2866.132281]  [] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281]  [] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]

The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.

timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.

We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags

Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.

For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.

Reported-by: Fengguang Wu 
Tested-by: Fengguang Wu 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

svcrpc: fix svc_xprt_enqueue/svc_recv busy-looping

2012-08-20T22:39:19+00:00

The rpc server tries to ensure that there will be room to send a reply
before it receives a request.

It does this by tracking, in xpt_reserved, an upper bound on the total
size of the replies that is has already committed to for the socket.

Currently it is adding in the estimate for a new reply *before* it
checks whether there is space available.  If it finds that there is not
space, it then subtracts the estimate back out.

This may lead the subsequent svc_xprt_enqueue to decide that there is
space after all.

The results is a svc_recv() that will repeatedly return -EAGAIN, causing
server threads to loop without doing any actual work.

Cc: stable@vger.kernel.org
Reported-by: Michael Tokarev 
Tested-by: Michael Tokarev 
Signed-off-by: J. Bruce Fields

svcrpc: sends on closed socket should stop immediately

2012-08-20T22:38:59+00:00

svc_tcp_sendto sets XPT_CLOSE if we fail to transmit the entire reply.
However, the XPT_CLOSE won't be acted on immediately.  Meanwhile other
threads could send further replies before the socket is really shut
down.  This can manifest as data corruption: for example, if a truncated
read reply is followed by another rpc reply, that second reply will look
to the client like further read data.

Symptoms were data corruption preceded by svc_tcp_sendto logging
something like

	kernel: rpc-srv/tcp: nfsd: sent only 963696 when sending 1048708 bytes - shutting down socket

Cc: stable@vger.kernel.org
Reported-by: Malahal Naineni 
Tested-by: Malahal Naineni 
Signed-off-by: J. Bruce Fields

svcrpc: fix BUG() in svc_tcp_clear_pages

2012-08-20T22:38:44+00:00

Examination of svc_tcp_clear_pages shows that it assumes sk_tcplen is
consistent with sk_pages[] (in particular, sk_pages[n] can't be NULL if
sk_tcplen would lead us to expect n pages of data).

svc_tcp_restore_pages zeroes out sk_pages[] while leaving sk_tcplen.
This is OK, since both functions are serialized by XPT_BUSY.  However,
that means the inconsistency must be repaired before dropping XPT_BUSY.

Therefore we should be ensuring that svc_tcp_save_pages repairs the
problem before exiting svc_tcp_recv_record on error.

Symptoms were a BUG() in svc_tcp_clear_pages.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields