linux-stable.git/net/tipc, branch linux-4.3.y

tipc: Fix kfree_skb() of uninitialised pointer

2016-01-23T04:55:40+00:00

Commit 7098356baca7 ("tipc: fix error handling of expanding buffer
headroom") added a "goto tx_error".  This is fine upstream, but
when backported to 4.3 it results in attempting to free the clone
before it has been allocated.  In this early error case, no
cleanup is needed.

Signed-off-by: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman

tipc: fix error handling of expanding buffer headroom

2015-12-15T05:41:03+00:00

[ Upstream commit 7098356baca723513e97ca0020df4e18bc353be3 ]

Coverity says:
*** CID 1338065:  Error handling issues  (CHECKED_RETURN)
/net/tipc/udp_media.c: 162 in tipc_udp_send_msg()
156             struct udp_media_addr *dst = (struct udp_media_addr *)&dest->value;
157             struct udp_media_addr *src = (struct udp_media_addr *)&b->addr.value;
158             struct sk_buff *clone;
159             struct rtable *rt;
160
161             if (skb_headroom(skb) < UDP_MIN_HEADROOM)
>>>     CID 1338065:  Error handling issues  (CHECKED_RETURN)
>>>     Calling "pskb_expand_head" without checking return value (as is done elsewhere 51 out of 56
+times).
162                     pskb_expand_head(skb, UDP_MIN_HEADROOM, 0, GFP_ATOMIC);
163
164             clone = skb_clone(skb, GFP_ATOMIC);
165             skb_set_inner_protocol(clone, htons(ETH_P_TIPC));
166             ub = rcu_dereference_rtnl(b->media_ptr);
167             if (!ub) {

When expanding buffer headroom over udp tunnel with pskb_expand_head(),
it's unfortunate that we don't check its return value. As a result, if
the function returns an error code due to the lack of memory, it may
cause unpredictable consequence as we unconditionally consider that
it's always successful.

Fixes: e53567948f82 ("tipc: conditionally expand buffer headroom over udp tunnel")
Reported-by: 
Cc: Stephen Hemminger 
Signed-off-by: Ying Xue 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tipc: linearize arriving NAME_DISTR and LINK_PROTO buffers

2015-12-09T19:34:06+00:00

[ Upstream commit 5cbb28a4bf65c7e4daa6c25b651fed8eb888c620 ]

Testing of the new UDP bearer has revealed that reception of
NAME_DISTRIBUTOR, LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE
message buffers is not prepared for the case that those may be
non-linear.

We now linearize all such buffers before they are delivered up to the
generic reception layer.

In order for the commit to apply cleanly to 'net' and 'stable', we do
the change in the function tipc_udp_recv() for now. Later, we will post
a commit to 'net-next' moving the linearization to generic code, in
tipc_named_rcv() and tipc_link_proto_rcv().

Fixes: commit d0f91938bede ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tipc: conditionally expand buffer headroom over udp tunnel

2015-10-22T02:13:48+00:00

In commit d999297c3dbbe ("tipc: reduce locking scope during packet reception")
we altered the packet retransmission function. Since then, when
restransmitting packets, we create a clone of the original buffer
using __pskb_copy(skb, MIN_H_SIZE), where MIN_H_SIZE is the size of
the area we want to have copied, but also the smallest possible TIPC
packet size. The value of MIN_H_SIZE is 24.

Unfortunately, __pskb_copy() also has the effect that the headroom
of the cloned buffer takes the size MIN_H_SIZE. This is too small
for carrying the packet over the UDP tunnel bearer, which requires
a minimum headroom of 28 bytes. A change to just use pskb_copy()
lets the clone inherit the original headroom of 80 bytes, but also
assumes that the copied data area is of at least that size, something
that is not always the case. So that is not a viable solution.

We now fix this by adding a check for sufficient headroom in the
transmit function of udp_media.c, and expanding it when necessary.

Fixes: commit d999297c3dbbe ("tipc: reduce locking scope during packet reception")
Signed-off-by: Jon Maloy 
Signed-off-by: David S. Miller

tipc: allow non-linear first fragment buffer

2015-10-22T02:08:24+00:00

The current code for message reassembly is erroneously assuming that
the the first arriving fragment buffer always is linear, and then goes
ahead resetting the fragment list of that buffer in anticipation of
more arriving fragments.

However, if the buffer already happens to be non-linear, we will
inadvertently drop the already attached fragment list, and later
on trig a BUG() in __pskb_pull_tail().

We see this happen when running fragmented TIPC multicast across UDP,
something made possible since
commit d0f91938bede ("tipc: add ip/udp media type")

We fix this by not resetting the fragment list when the buffer is non-
linear, and by initiatlizing our private fragment list tail pointer to
the tail of the existing fragment list.

Fixes: commit d0f91938bede ("tipc: add ip/udp media type")
Signed-off-by: Jon Maloy 
Signed-off-by: David S. Miller

tipc: extend broadcast link window size

2015-10-22T02:02:17+00:00

The default fix broadcast window size is currently set to 20 packets.
This is a very low value, set at a time when we were still testing on
10 Mb/s hubs, and a change to it is long overdue.

Commit 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
revealed a problem with this low value. For messages of importance LOW,
the backlog queue limit will be  calculated to 30 packets, while a
single, maximum sized message of 66000 bytes, carried across a 1500 MTU
network consists of 46 packets.

This leads to the following scenario (among others leading to the same
situation):

1: Msg 1 of 46 packets is sent. 20 packets go to the transmit queue, 26
   packets to the backlog queue.
2: Msg 2 of 46 packets is attempted sent, but rejected because there is
   no more space in the backlog queue at this level. The sender is added
   to the wakeup queue with a "pending packets chain size" number of 46.
3: Some packets in the transmit queue are acked and released. We try to
   wake up the sender, but the pending size of 46 is bigger than the LOW
   wakeup limit of 30, so this doesn't happen.
5: Subsequent acks releases all the remaining buffers. Each time we test
   for the wakeup criteria and find that 46 still is larger than 30,
   even after both the transmit and the backlog queues are empty.
6: The sender is never woken up and given a chance to send its message.
   He is stuck.

We could now loosen the wakeup criteria (used by link_prepare_wakeup())
to become equal to the send criteria (used by tipc_link_xmit()), i.e.,
by ignoring the "pending packets chain size" value altogether, or we can
just increase the queue limits so that the criteria can be satisfied
anyway. There are good reasons (potentially multiple waiting senders) to
not opt for the former solution, so we choose the latter one.

This commit fixes the problem by giving the broadcast link window a
default value of 50 packets. We also introduce a new minimum link
window size BCLINK_MIN_WIN of 32, which is enough to always avoid the
described situation. Finally, in order to not break any existing users
which may set the window explicitly, we enforce that the window is set
to the new minimum value in case the user is trying to set it to
anything lower.

Fixes: 7845989cb4b3da1db ("net: tipc: fix stall during bclink wakeup procedure")
Signed-off-by: Jon Maloy 
Reviewed-by: Ying Xue 
Signed-off-by: David S. Miller

tipc: move fragment importance field to new header position

2015-10-15T02:10:08+00:00

In commit e3eea1eb47a ("tipc: clean up handling of message priorities")
we introduced a field in the packet header for keeping track of the
priority of fragments, since this value is not present in the specified
protocol header. Since the value so far only is used at the transmitting
end of the link, we have not yet officially defined it as part of the
protocol.

Unfortunately, the field we use for keeping this value, bits 13-15 in
in word 5, has turned out to be a poor choice; it is already used by the
broadcast protocol for carrying the 'network id' field of the sending
node. Since packet fragments also need to be transported across the
broadcast protocol, the risk of conflict is obvious, and we see this
happen when we use network identities larger than 2^13-1. This has
escaped our testing because we have so far only been using small network
id values.

We now move this field to bits 0-2 in word 9, a field that is guaranteed
to be unused by all involved protocols.

Fixes: e3eea1eb47a ("tipc: clean up handling of message priorities")
Signed-off-by: Jon Maloy 
Acked-by: Ying Xue 
Signed-off-by: David S. Miller

tipc: eliminate risk of stalled link synchronization

2015-10-14T13:06:40+00:00

In commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")
we introduced a new mechanism for performing link failover and
synchronization. We have now detected a bug in this mechanism.

During link synchronization we use the arrival of any packet on
the tunnel link to trig a check for whether it has reached the
synchronization point or not. This has turned out to be too
permissive, since it may cause an arriving non-last SYNCH packet to
end the synch state, just to see the next SYNCH packet initiate a
new synch state with a new, higher synch point. This is not fatal,
but should be avoided, because it may significantly extend the
synchronization period, while at the same time we are not allowed
to send NACKs if packets are lost. In the worst case, a low-traffic
user may see its traffic stall until a LINK_PROTOCOL state message
trigs the link to leave synchronization state.

At the same time, LINK_PROTOCOL packets which happen to have a (non-
valid) sequence number lower than the tunnel link's rcv_nxt value will
be consistently dropped, and will never be able to resolve the situation
described above.

We fix this by exempting LINK_PROTOCOL packets from the sequence number
check, as they should be. We also reduce (but don't completely
eliminate) the risk of entering multiple synchronization states by only
allowing the (logically) first SYNCH packet to initiate a synchronization
state. This works independently of actual packet arrival order.

Fixes: commit 6e498158a827 ("tipc: move link synch and failover to link aggregation level")

Signed-off-by: Jon Maloy
Acked-by: Ying Xue
Signed-off-by: David S. Miller

tipc: reinitialize pointer after skb linearize

2015-09-21T05:31:20+00:00

The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne 
Reported-by: Tamás Végh 
Acked-by: Ying Xue 
Signed-off-by: David S. Miller

net: tipc: fix stall during bclink wakeup procedure

2015-09-09T05:50:26+00:00

If an attempt to wake up users of broadcast link is made when there is
no enough place in send queue than it may hang up inside the
tipc_sk_rcv() function since the loop breaks only after the wake up
queue becomes empty. This can lead to complete CPU stall with the
following message generated by RCU:

INFO: rcu_sched self-detected stall on CPU { 0}  (t=2101 jiffies
					g=54225 c=54224 q=11465)
Task dump for CPU 0:
tpch            R  running task        0 39949  39948 0x0000000a
 ffffffff818536c0 ffff88181fa037a0 ffffffff8106a4be 0000000000000000
 ffffffff818536c0 ffff88181fa037c0 ffffffff8106d8a8 ffff88181fa03800
 0000000000000001 ffff88181fa037f0 ffffffff81094a50 ffff88181fa15680
Call Trace:
   [] sched_show_task+0xae/0x120
 [] dump_cpu_task+0x38/0x40
 [] rcu_dump_cpu_stacks+0x90/0xd0
 [] rcu_check_callbacks+0x3eb/0x6e0
 [] ? account_system_time+0x7f/0x170
 [] update_process_times+0x34/0x60
 [] tick_sched_handle.isra.18+0x31/0x40
 [] tick_sched_timer+0x3c/0x70
 [] __run_hrtimer.isra.34+0x3d/0xc0
 [] hrtimer_interrupt+0xc5/0x1e0
 [] ? native_smp_send_reschedule+0x42/0x60
 [] local_apic_timer_interrupt+0x34/0x60
 [] smp_apic_timer_interrupt+0x3c/0x60
 [] apic_timer_interrupt+0x6b/0x70
 [] ? _raw_spin_unlock_irqrestore+0x9/0x10
 [] __wake_up_sync_key+0x4f/0x60
 [] tipc_write_space+0x31/0x40 [tipc]
 [] filter_rcv+0x31f/0x520 [tipc]
 [] ? tipc_sk_lookup+0xc9/0x110 [tipc]
 [] ? _raw_spin_lock_bh+0x19/0x30
 [] tipc_sk_rcv+0x2dc/0x3e0 [tipc]
 [] tipc_bclink_wakeup_users+0x2f/0x40 [tipc]
 [] tipc_node_unlock+0x186/0x190 [tipc]
 [] ? kfree_skb+0x2c/0x40
 [] tipc_rcv+0x2ac/0x8c0 [tipc]
 [] tipc_l2_rcv_msg+0x38/0x50 [tipc]
 [] __netif_receive_skb_core+0x5a3/0x950
 [] __netif_receive_skb+0x13/0x60
 [] netif_receive_skb_internal+0x1e/0x90
 [] napi_gro_receive+0x78/0xa0
 [] tg3_poll_work+0xc54/0xf40 [tg3]
 [] ? consume_skb+0x2c/0x40
 [] tg3_poll_msix+0x41/0x160 [tg3]
 [] net_rx_action+0xe2/0x290
 [] __do_softirq+0xda/0x1f0
 [] irq_exit+0x76/0xa0
 [] do_IRQ+0x55/0xf0
 [] common_interrupt+0x6b/0x6b
 

The issue occurs only when tipc_sk_rcv() is used to wake up postponed
senders:

	tipc_bclink_wakeup_users()
		// wakeupq - is a queue which consists of special
		// 		 messages with SOCK_WAKEUP type.
		tipc_sk_rcv(wakeupq)
			...
			while (skb_queue_len(inputq)) {
				filter_rcv(skb)
					// Here the type of message is checked
					// and if it is SOCK_WAKEUP then
					// it tries to wake up a sender.
					tipc_write_space(sk)
						wake_up_interruptible_sync_poll()
			}

After the sender thread is woke up it can gather control and perform
an attempt to send a message. But if there is no enough place in send
queue it will call link_schedule_user() function which puts a message
of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep.
Thus the size of the queue actually is not changed and the while()
loop never exits.

The approach I proposed is to wake up only senders for which there is
enough place in send queue so the described issue can't occur.
Moreover the same approach is already used to wake up senders on
unicast links.

I have got into the issue on our product code but to reproduce the
issue I changed a benchmark test application (from
tipcutils/demos/benchmark) to perform the following scenario:
	1. Run 64 instances of test application (nodes). It can be done
	   on the one physical machine.
	2. Each application connects to all other using TIPC sockets in
	   RDM mode.
	3. When setup is done all nodes start simultaneously send
	   broadcast messages.
	4. Everything hangs up.

The issue is reproducible only when a congestion on broadcast link
occurs. For example, when there are only 8 nodes it works fine since
congestion doesn't occur. Send queue limit is 40 in my case (I use a
critical importance level) and when 64 nodes send a message at the
same moment a congestion occurs every time.

Signed-off-by: Dmitry S Kolmakov 
Reviewed-by: Jon Maloy 
Acked-by: Ying Xue 
Signed-off-by: David S. Miller