<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/net/ipv4/ip_input.c, branch v4.19</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>net: ipv4: fix listify ip_rcv_finish in case of forwarding</title>
<updated>2018-07-12T23:40:19+00:00</updated>
<author>
<name>Jesper Dangaard Brouer</name>
<email>brouer@redhat.com</email>
</author>
<published>2018-07-11T15:01:20+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0761680d521559813648fb430ddeb479c97ab060'/>
<id>0761680d521559813648fb430ddeb479c97ab060</id>
<content type='text'>
In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
dst_input(skb) in a loop.

The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
from the list before invoking dst_input().  Further more we need to
clear skb-&gt;next as other parts of the network stack use another kind
of SKB lists for xmit_more (see dev_hard_start_xmit).

A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().

This patch only fixes the crash, but there is a huge potential for
a performance boost if we can pass an SKB-list through to ip_forward.

Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Acked-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In commit 5fa12739a53d ("net: ipv4: listify ip_rcv_finish") calling
dst_input(skb) was split-out.  The ip_sublist_rcv_finish() just calls
dst_input(skb) in a loop.

The problem is that ip_sublist_rcv_finish() forgot to remove the SKB
from the list before invoking dst_input().  Further more we need to
clear skb-&gt;next as other parts of the network stack use another kind
of SKB lists for xmit_more (see dev_hard_start_xmit).

A crash occurs if e.g. dst_input() invoke ip_forward(), which calls
dst_output()/ip_output() that eventually calls __dev_queue_xmit() +
sch_direct_xmit(), and a crash occurs in validate_xmit_skb_list().

This patch only fixes the crash, but there is a huge potential for
a performance boost if we can pass an SKB-list through to ip_forward.

Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Jesper Dangaard Brouer &lt;brouer@redhat.com&gt;
Acked-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: ipv4: fix list processing on L3 slave devices</title>
<updated>2018-07-06T02:19:07+00:00</updated>
<author>
<name>Edward Cree</name>
<email>ecree@solarflare.com</email>
</author>
<published>2018-07-05T14:47:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=efe6aaca67a0229a195f493d142102c864b41dde'/>
<id>efe6aaca67a0229a195f493d142102c864b41dde</id>
<content type='text'>
If we have an L3 master device, l3mdev_ip_rcv() will steal the skb, but
 we were returning NET_RX_SUCCESS from ip_rcv_finish_core() which meant
 that ip_list_rcv_finish() would keep it on the list.  Instead let's
 move the l3mdev_ip_rcv() call into the caller, so that our response to
 a steal can be different in the single packet path (return
 NET_RX_SUCCESS) and the list path (forget this packet and continue).

Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If we have an L3 master device, l3mdev_ip_rcv() will steal the skb, but
 we were returning NET_RX_SUCCESS from ip_rcv_finish_core() which meant
 that ip_list_rcv_finish() would keep it on the list.  Instead let's
 move the l3mdev_ip_rcv() call into the caller, so that our response to
 a steal can be different in the single packet path (return
 NET_RX_SUCCESS) and the list path (forget this packet and continue).

Fixes: 5fa12739a53d ("net: ipv4: listify ip_rcv_finish")
Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()</title>
<updated>2018-07-05T02:25:41+00:00</updated>
<author>
<name>Edward Cree</name>
<email>ecree@solarflare.com</email>
</author>
<published>2018-07-04T18:23:50+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a4ca8b7df73c6d78b8b5aa8246a7d794b25c25ce'/>
<id>a4ca8b7df73c6d78b8b5aa8246a7d794b25c25ce</id>
<content type='text'>
Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal
 the skb, we can't use the list_cut_before() method; we can't even do a
 list_del(&amp;skb-&gt;list) in the drop case, because skb might have already been
 freed and reused.
So instead, take each skb off the source list before processing, and add it
 to the sublist afterwards if it wasn't freed or stolen.

Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish
Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv
Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Since callees (ip_rcv_core() and ip_rcv_finish_core()) might free or steal
 the skb, we can't use the list_cut_before() method; we can't even do a
 list_del(&amp;skb-&gt;list) in the drop case, because skb might have already been
 freed and reused.
So instead, take each skb off the source list before processing, and add it
 to the sublist afterwards if it wasn't freed or stolen.

Fixes: 5fa12739a53d net: ipv4: listify ip_rcv_finish
Fixes: 17266ee93984 net: ipv4: listified version of ip_rcv
Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: ipv4: listify ip_rcv_finish</title>
<updated>2018-07-04T05:06:20+00:00</updated>
<author>
<name>Edward Cree</name>
<email>ecree@solarflare.com</email>
</author>
<published>2018-07-02T15:14:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5fa12739a53d0780265ed9d44d9ec9ba5f9ad00a'/>
<id>5fa12739a53d0780265ed9d44d9ec9ba5f9ad00a</id>
<content type='text'>
ip_rcv_finish_core(), if it does not drop, sets skb-&gt;dst by either early
 demux or route lookup.  The last step, calling dst_input(skb), is left to
 the caller; in the listified case, we split to form sublists with a common
 dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
 to struct dst_entry.

Early demux is an indirect call based on iph-&gt;protocol; this is another
 opportunity for listification which is not taken here (it would require
 slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ip_rcv_finish_core(), if it does not drop, sets skb-&gt;dst by either early
 demux or route lookup.  The last step, calling dst_input(skb), is left to
 the caller; in the listified case, we split to form sublists with a common
 dst, but then ip_sublist_rcv_finish() just calls dst_input(skb) in a loop.
The next step in listification would thus be to add a list_input() method
 to struct dst_entry.

Early demux is an indirect call based on iph-&gt;protocol; this is another
 opportunity for listification which is not taken here (it would require
 slicing up ip_rcv_finish_core() to allow splitting on protocol changes).

Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: ipv4: listified version of ip_rcv</title>
<updated>2018-07-04T05:06:20+00:00</updated>
<author>
<name>Edward Cree</name>
<email>ecree@solarflare.com</email>
</author>
<published>2018-07-02T15:14:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=17266ee939849cb095ed7dd9edbec4162172226b'/>
<id>17266ee939849cb095ed7dd9edbec4162172226b</id>
<content type='text'>
Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Also involved adding a way to run a netfilter hook over a list of packets.
 Rather than attempting to make netfilter know about lists (which would be
 a major project in itself) we just let it call the regular okfn (in this
 case ip_rcv_finish()) for any packets it steals, and have it give us back
 a list of packets it's synchronously accepted (which normally NF_HOOK
 would automatically call okfn() on, but we want to be able to potentially
 pass the list to a listified version of okfn().)
The netfilter hooks themselves are indirect calls that still happen per-
 packet (see nf_hook_entry_hookfn()), but again, changing that can be left
 for future work.

There is potential for out-of-order receives if the netfilter hook ends up
 synchronously stealing packets, as they will be processed before any
 accepts earlier in the list.  However, it was already possible for an
 asynchronous accept to cause out-of-order receives, so presumably this is
 considered OK.

Signed-off-by: Edward Cree &lt;ecree@solarflare.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: Make ip_ra_chain per struct net</title>
<updated>2018-03-22T19:12:56+00:00</updated>
<author>
<name>Kirill Tkhai</name>
<email>ktkhai@virtuozzo.com</email>
</author>
<published>2018-03-22T09:45:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5796ef75ec7b6019eac88f66751d663d537a5cd3'/>
<id>5796ef75ec7b6019eac88f66751d663d537a5cd3</id>
<content type='text'>
This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>IPv4: early demux can return an error code</title>
<updated>2017-10-01T02:55:47+00:00</updated>
<author>
<name>Paolo Abeni</name>
<email>pabeni@redhat.com</email>
</author>
<published>2017-09-28T13:51:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=7487449c86c65202b3b725c4524cb48dd65e4e6f'/>
<id>7487449c86c65202b3b725c4524cb48dd65e4e6f</id>
<content type='text'>
Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently no error is emitted, but this infrastructure will
used by the next patch to allow source address validation
for mcast sockets.
Since early demux can do a route lookup and an ipv4 route
lookup can return an error code this is consistent with the
current ipv4 route infrastructure.

Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: Add sysctl to toggle early demux for tcp and udp</title>
<updated>2017-03-24T20:17:07+00:00</updated>
<author>
<name>subashab@codeaurora.org</name>
<email>subashab@codeaurora.org</email>
</author>
<published>2017-03-23T19:34:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=dddb64bcb34615bf48a2c9cb9881eb76795cc5c5'/>
<id>dddb64bcb34615bf48a2c9cb9881eb76795cc5c5</id>
<content type='text'>
Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -&gt; 788Mbps unconnected single stream UDPv4
633 -&gt; 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1-&gt;v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2-&gt;v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3-&gt;v4: Store and use read once result instead of querying pointer
again incorrectly.

v4-&gt;v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan &lt;subashab@codeaurora.org&gt;
Suggested-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Stephen Hemminger &lt;stephen@networkplumber.org&gt;
Cc: Tom Herbert &lt;tom@herbertland.com&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Certain system process significant unconnected UDP workload.
It would be preferrable to disable UDP early demux for those systems
and enable it for TCP only.

By disabling UDP demux, we see these slight gains on an ARM64 system-
782 -&gt; 788Mbps unconnected single stream UDPv4
633 -&gt; 654Mbps unconnected UDPv4 different sources

The performance impact can change based on CPU architecure and cache
sizes. There will not much difference seen if entire UDP hash table
is in cache.

Both sysctls are enabled by default to preserve existing behavior.

v1-&gt;v2: Change function pointer instead of adding conditional as
suggested by Stephen.

v2-&gt;v3: Read once in callers to avoid issues due to compiler
optimizations. Also update commit message with the tests.

v3-&gt;v4: Store and use read once result instead of querying pointer
again incorrectly.

v4-&gt;v5: Refactor to avoid errors due to compilation with IPV6={m,n}

Signed-off-by: Subash Abhinov Kasiviswanathan &lt;subashab@codeaurora.org&gt;
Suggested-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Stephen Hemminger &lt;stephen@networkplumber.org&gt;
Cc: Tom Herbert &lt;tom@herbertland.com&gt;
Cc: David Miller &lt;davem@davemloft.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: VRF: Pass original iif to ip_route_input()</title>
<updated>2016-09-16T08:24:07+00:00</updated>
<author>
<name>Mark Tomlinson</name>
<email>mark.tomlinson@alliedtelesis.co.nz</email>
</author>
<published>2016-09-14T23:40:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d6f64d725bac20df66b2eacd847fc41d7a1905e0'/>
<id>d6f64d725bac20df66b2eacd847fc41d7a1905e0</id>
<content type='text'>
The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
the global VRF, this replaces skb-&gt;dev with the VRF master interface.
When calling ip_route_input_noref() from here, the checks for forwarding
look at this master device instead of the initial ingress interface.
This will allow packets to be routed which normally would be dropped.
For example, an interface that is not assigned an IP address should
drop packets, but because the checking is against the master device, the
packet will be forwarded.

The fix here is to still call l3mdev_ip_rcv(), but remember the initial
net_device. This is passed to the other functions within ip_rcv_finish,
so they still see the original interface.

Signed-off-by: Mark Tomlinson &lt;mark.tomlinson@alliedtelesis.co.nz&gt;
Acked-by: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The function ip_rcv_finish() calls l3mdev_ip_rcv(). On any VRF except
the global VRF, this replaces skb-&gt;dev with the VRF master interface.
When calling ip_route_input_noref() from here, the checks for forwarding
look at this master device instead of the initial ingress interface.
This will allow packets to be routed which normally would be dropped.
For example, an interface that is not assigned an IP address should
drop packets, but because the checking is against the master device, the
packet will be forwarded.

The fix here is to still call l3mdev_ip_rcv(), but remember the initial
net_device. This is passed to the other functions within ip_rcv_finish,
so they still see the original interface.

Signed-off-by: Mark Tomlinson &lt;mark.tomlinson@alliedtelesis.co.nz&gt;
Acked-by: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>net: original ingress device index in PKTINFO</title>
<updated>2016-05-11T23:31:40+00:00</updated>
<author>
<name>David Ahern</name>
<email>dsa@cumulusnetworks.com</email>
</author>
<published>2016-05-10T18:19:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0b922b7a829c06e3b0790c58cd9ca026de86096e'/>
<id>0b922b7a829c06e3b0790c58cd9ca026de86096e</id>
<content type='text'>
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.

The previous patch moving the skb-&gt;dev change to L3 means nothing else
is needed for IPv6; it just works.

Signed-off-by: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Applications such as OSPF and BFD need the original ingress device not
the VRF device; the latter can be derived from the former. To that end
add the skb_iif to inet_skb_parm and set it in ipv4 code after clearing
the skb control buffer similar to IPv6. From there the pktinfo can just
pull it from cb with the PKTINFO_SKB_CB cast.

The previous patch moving the skb-&gt;dev change to L3 means nothing else
is needed for IPv6; it just works.

Signed-off-by: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
</feed>
