<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/include/net, branch v4.1-rc2</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>net/bonding: Make DRV macros private</title>
<updated>2015-04-27T02:59:53+00:00</updated>
<author>
<name>Matan Barak</name>
<email>matanb@mellanox.com</email>
</author>
<published>2015-04-26T12:55:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=73b5a6f2a7a1cb78ccdec3900afc8657e11bc6bf'/>
<id>73b5a6f2a7a1cb78ccdec3900afc8657e11bc6bf</id>
<content type='text'>
The bonding modules currently defines four macros with
general names that pollute the global namespace:
DRV_VERSION
DRV_RELDATE
DRV_NAME
DRV_DESCRIPTION

Fixing that by defining a private bonding_priv.h
header files which includes those defines.

Signed-off-by: Matan Barak &lt;matanb@mellanox.com&gt;
Signed-off-by: Or Gerlitz &lt;ogerlitz@mellanox.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The bonding modules currently defines four macros with
general names that pollute the global namespace:
DRV_VERSION
DRV_RELDATE
DRV_NAME
DRV_DESCRIPTION

Fixing that by defining a private bonding_priv.h
header files which includes those defines.

Signed-off-by: Matan Barak &lt;matanb@mellanox.com&gt;
Signed-off-by: Or Gerlitz &lt;ogerlitz@mellanox.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>inet: fix possible panic in reqsk_queue_unlink()</title>
<updated>2015-04-24T15:39:15+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-04-24T01:03:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b357a364c57c940ddb932224542494363df37378'/>
<id>b357a364c57c940ddb932224542494363df37378</id>
<content type='text'>
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
 0000000000000080
[ 3897.931025] IP: [&lt;ffffffffa9f27686&gt;] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() &amp; tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
 0000000000000080
[ 3897.931025] IP: [&lt;ffffffffa9f27686&gt;] reqsk_timer_handler+0x1a6/0x243

There is a race when reqsk_timer_handler() and tcp_check_req() call
inet_csk_reqsk_queue_unlink() on the same req at the same time.

Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
timer"), listener spinlock was held and race could not happen.

To solve this bug, we change reqsk_queue_unlink() to not assume req
must be found, and we return a status, to conditionally release a
refcount on the request sock.

This also means tcp_check_req() in non fastopen case might or not
consume req refcount, so tcp_v6_hnd_req() &amp; tcp_v4_hnd_req() have
to properly handle this.

(Same remark for dccp_check_req() and its callers)

inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
called 4 times in tcp and 3 times in dccp.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Reported-by: Yuchung Cheng &lt;ycheng@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net</title>
<updated>2015-04-17T20:31:08+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-04-17T20:31:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=388f997620cb57372c494a194e9698b28cc179b8'/>
<id>388f997620cb57372c494a194e9698b28cc179b8</id>
<content type='text'>
Pull networking fixes from David Miller:

 1) Fix verifier memory corruption and other bugs in BPF layer, from
    Alexei Starovoitov.

 2) Add a conservative fix for doing BPF properly in the BPF classifier
    of the packet scheduler on ingress.  Also from Alexei.

 3) The SKB scrubber should not clear out the packet MARK and security
    label, from Herbert Xu.

 4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.

 5) Pause handling is not correct in the stmmac driver because it
    doesn't take into consideration the RX and TX fifo sizes.  From
    Vince Bridgers.

 6) Failure path missing unlock in FOU driver, from Wang Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  net: dsa: use DEVICE_ATTR_RW to declare temp1_max
  netns: remove BUG_ONs from net_generic()
  IB/ipoib: Fix ndo_get_iflink
  sfc: Fix memcpy() with const destination compiler warning.
  altera tse: Fix network-delays and -retransmissions after high throughput.
  net: remove unused 'dev' argument from netif_needs_gso()
  act_mirred: Fix bogus header when redirecting from VLAN
  inet_diag: fix access to tcp cc information
  tcp: tcp_get_info() should fetch socket fields once
  net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
  skbuff: Do not scrub skb mark within the same name space
  Revert "net: Reset secmark when scrubbing packet"
  bpf: fix two bugs in verification logic when accessing 'ctx' pointer
  bpf: fix bpf helpers to use skb-&gt;mac_header relative offsets
  stmmac: Configure Flow Control to work correctly based on rxfifo size
  stmmac: Enable unicast pause frame detect in GMAC Register 6
  stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
  stmmac: Add defines and documentation for enabling flow control
  stmmac: Add properties for transmit and receive fifo sizes
  stmmac: fix oops on rmmod after assigning ip addr
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull networking fixes from David Miller:

 1) Fix verifier memory corruption and other bugs in BPF layer, from
    Alexei Starovoitov.

 2) Add a conservative fix for doing BPF properly in the BPF classifier
    of the packet scheduler on ingress.  Also from Alexei.

 3) The SKB scrubber should not clear out the packet MARK and security
    label, from Herbert Xu.

 4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.

 5) Pause handling is not correct in the stmmac driver because it
    doesn't take into consideration the RX and TX fifo sizes.  From
    Vince Bridgers.

 6) Failure path missing unlock in FOU driver, from Wang Cong.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  net: dsa: use DEVICE_ATTR_RW to declare temp1_max
  netns: remove BUG_ONs from net_generic()
  IB/ipoib: Fix ndo_get_iflink
  sfc: Fix memcpy() with const destination compiler warning.
  altera tse: Fix network-delays and -retransmissions after high throughput.
  net: remove unused 'dev' argument from netif_needs_gso()
  act_mirred: Fix bogus header when redirecting from VLAN
  inet_diag: fix access to tcp cc information
  tcp: tcp_get_info() should fetch socket fields once
  net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
  skbuff: Do not scrub skb mark within the same name space
  Revert "net: Reset secmark when scrubbing packet"
  bpf: fix two bugs in verification logic when accessing 'ctx' pointer
  bpf: fix bpf helpers to use skb-&gt;mac_header relative offsets
  stmmac: Configure Flow Control to work correctly based on rxfifo size
  stmmac: Enable unicast pause frame detect in GMAC Register 6
  stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
  stmmac: Add defines and documentation for enabling flow control
  stmmac: Add properties for transmit and receive fifo sizes
  stmmac: fix oops on rmmod after assigning ip addr
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>netns: remove BUG_ONs from net_generic()</title>
<updated>2015-04-17T19:21:48+00:00</updated>
<author>
<name>Denys Vlasenko</name>
<email>dvlasenk@redhat.com</email>
</author>
<published>2015-04-17T17:06:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2591ffd3085e34a231480efe8f1cac83ebb81725'/>
<id>2591ffd3085e34a231480efe8f1cac83ebb81725</id>
<content type='text'>
This inline has ~500 callsites.

On 04/14/2015 08:37 PM, David Miller wrote:
&gt; That BUG_ON() was added 7 years ago, and I don't remember it ever
&gt; triggering or helping us diagnose something, so just remove it and
&gt; keep the function inlined.

On x86 allyesconfig build:

    text     data      bss       dec     hex filename
82447071 22255384 20627456 125329911 77861f7 vmlinux4
82441375 22255384 20627456 125324215 7784bb7 vmlinux5prime

Signed-off-by: Denys Vlasenko &lt;dvlasenk@redhat.com&gt;
CC: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
CC: David S. Miller &lt;davem@davemloft.net&gt;
CC: Jan Engelhardt &lt;jengelh@medozas.de&gt;
CC: Jiri Pirko &lt;jpirko@redhat.com&gt;
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This inline has ~500 callsites.

On 04/14/2015 08:37 PM, David Miller wrote:
&gt; That BUG_ON() was added 7 years ago, and I don't remember it ever
&gt; triggering or helping us diagnose something, so just remove it and
&gt; keep the function inlined.

On x86 allyesconfig build:

    text     data      bss       dec     hex filename
82447071 22255384 20627456 125329911 77861f7 vmlinux4
82441375 22255384 20627456 125324215 7784bb7 vmlinux5prime

Signed-off-by: Denys Vlasenko &lt;dvlasenk@redhat.com&gt;
CC: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
CC: David S. Miller &lt;davem@davemloft.net&gt;
CC: Jan Engelhardt &lt;jengelh@medozas.de&gt;
CC: Jiri Pirko &lt;jpirko@redhat.com&gt;
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>inet_diag: fix access to tcp cc information</title>
<updated>2015-04-17T17:28:31+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-04-17T01:10:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=521f1cf1dbb9d5ad858dca5dc75d1b45f64b6589'/>
<id>521f1cf1dbb9d5ad858dca5dc75d1b45f64b6589</id>
<content type='text'>
Two different problems are fixed here :

1) inet_sk_diag_fill() might be called without socket lock held.
   icsk-&gt;icsk_ca_ops can change under us and module be unloaded.
   -&gt; Access to freed memory.
   Fix this using rcu_read_lock() to prevent module unload.

2) Some TCP Congestion Control modules provide information
   but again this is not safe against icsk-&gt;icsk_ca_ops
   change and nla_put() errors were ignored. Some sockets
   could not get the additional info if skb was almost full.

Fix this by returning a status from get_info() handlers and
using rcu protection as well.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Two different problems are fixed here :

1) inet_sk_diag_fill() might be called without socket lock held.
   icsk-&gt;icsk_ca_ops can change under us and module be unloaded.
   -&gt; Access to freed memory.
   Fix this using rcu_read_lock() to prevent module unload.

2) Some TCP Congestion Control modules provide information
   but again this is not safe against icsk-&gt;icsk_ca_ops
   change and nla_put() errors were ignored. Some sockets
   could not get the additional info if skb was almost full.

Fix this by returning a status from get_info() handlers and
using rcu protection as well.

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2015-04-15T20:22:56+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-04-15T20:22:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=fa927894bbb4a4c7669c72bad1924991022fda38'/>
<id>fa927894bbb4a4c7669c72bad1924991022fda38</id>
<content type='text'>
Pull second vfs update from Al Viro:
 "Now that net-next went in...  Here's the next big chunk - killing
  -&gt;aio_read() and -&gt;aio_write().

  There'll be one more pile today (direct_IO changes and
  generic_write_checks() cleanups/fixes), but I'd prefer to keep that
  one separate"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  -&gt;aio_read and -&gt;aio_write removed
  pcm: another weird API abuse
  infinibad: weird APIs switched to -&gt;write_iter()
  kill do_sync_read/do_sync_write
  fuse: use iov_iter_get_pages() for non-splice path
  fuse: switch to -&gt;read_iter/-&gt;write_iter
  switch drivers/char/mem.c to -&gt;read_iter/-&gt;write_iter
  make new_sync_{read,write}() static
  coredump: accept any write method
  switch /dev/loop to vfs_iter_write()
  serial2002: switch to __vfs_read/__vfs_write
  ashmem: use __vfs_read()
  export __vfs_read()
  autofs: switch to __vfs_write()
  new helper: __vfs_write()
  switch hugetlbfs to -&gt;read_iter()
  coda: switch to -&gt;read_iter/-&gt;write_iter
  ncpfs: switch to -&gt;read_iter/-&gt;write_iter
  net/9p: remove (now-)unused helpers
  p9_client_attach(): set fid-&gt;uid correctly
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull second vfs update from Al Viro:
 "Now that net-next went in...  Here's the next big chunk - killing
  -&gt;aio_read() and -&gt;aio_write().

  There'll be one more pile today (direct_IO changes and
  generic_write_checks() cleanups/fixes), but I'd prefer to keep that
  one separate"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  -&gt;aio_read and -&gt;aio_write removed
  pcm: another weird API abuse
  infinibad: weird APIs switched to -&gt;write_iter()
  kill do_sync_read/do_sync_write
  fuse: use iov_iter_get_pages() for non-splice path
  fuse: switch to -&gt;read_iter/-&gt;write_iter
  switch drivers/char/mem.c to -&gt;read_iter/-&gt;write_iter
  make new_sync_{read,write}() static
  coredump: accept any write method
  switch /dev/loop to vfs_iter_write()
  serial2002: switch to __vfs_read/__vfs_write
  ashmem: use __vfs_read()
  export __vfs_read()
  autofs: switch to __vfs_write()
  new helper: __vfs_write()
  switch hugetlbfs to -&gt;read_iter()
  coda: switch to -&gt;read_iter/-&gt;write_iter
  ncpfs: switch to -&gt;read_iter/-&gt;write_iter
  net/9p: remove (now-)unused helpers
  p9_client_attach(): set fid-&gt;uid correctly
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next</title>
<updated>2015-04-14T22:51:19+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-04-14T22:51:19+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bae97d84100ae7a8dc3b79233ecd3a8f7c19ea57'/>
<id>bae97d84100ae7a8dc3b79233ecd3a8f7c19ea57</id>
<content type='text'>
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

A final pull request, I know it's very late but this time I think it's worth a
bit of rush.

The following patchset contains Netfilter/nf_tables updates for net-next, more
specifically concatenation support and dynamic stateful expression
instantiation.

This also comes with a couple of small patches. One to fix the ebtables.h
userspace header and another to get rid of an obsolete example file in tree
that describes a nf_tables expression.

This time, I decided to paste the original descriptions. This will result in a
rather large commit description, but I think these bytes to keep.

Patrick McHardy says:

====================
netfilter: nf_tables: concatenation support

The following patches add support for concatenations, which allow multi
dimensional exact matches in O(1).

The basic idea is to split the data registers, currently consisting of
4 registers of 16 bytes each, into smaller units, 16 registers of 4
bytes each, and making sure each register store always leaves the
full 32 bit in a well defined state, meaning smaller stores will
zero the remaining bits.

Based on that, we can load multiple adjacent registers with different
values, thereby building a concatenated bigger value, and use that
value for set lookups.

Sets are changed to use variable sized extensions for their key and
data values, removing the fixed limit of 16 bytes while saving memory
if less space is needed.

As a side effect, these patches will allow some nice optimizations in
the future, like using jhash2 in nft_hash, removing the masking in
nft_cmp_fast, optimized data comparison using 32 bit word size etc.
These are not done so far however.

The patches are split up as follows:

 * the first five patches add length validation to register loads and
   stores to make sure we stay within bounds and prepare the validation
   functions for the new addressing mode

 * the next patches prepare for changing to 32 bit addressing by
   introducing a struct nft_regs, which holds the verdict register as
   well as the data registers. The verdict members are moved to a new
   struct nft_verdict to allow to pull struct nft_data out of the stack.

 * the next patches contain preparatory conversions of expressions and
   sets to use 32 bit addressing

 * the next patch introduces so far unused register conversion helpers
   for parsing and dumping register numbers over netlink

 * following is the real conversion to 32 bit addressing, consisting of
   replacing struct nft_data in struct nft_regs by an array of u32s and
   actually translating and validating the new register numbers.

 * the final two patches add support for variable sized data items and
   variable sized keys / data in set elements

The patches have been verified to work correctly with nft binaries using
both old and new addressing.
====================

Patrick McHardy says:

====================
netfilter: nf_tables: dynamic stateful expression instantiation

The following patches are the grand finale of my nf_tables set work,
using all the building blocks put in place by the previous patches
to support something like iptables hashlimit, but a lot more powerful.

Sets are extended to allow attaching expressions to set elements.
The dynset expression dynamically instantiates these expressions
based on a template when creating new set elements and evaluates
them for all new or updated set members.

In combination with concatenations this effectively creates state
tables for arbitrary combinations of keys, using the existing
expression types to maintain that state. Regular set GC takes care
of purging expired states.

We currently support two different stateful expressions, counter
and limit. Using limit as a template we can express the functionality
of hashlimit, but completely unrestricted in the combination of keys.
Using counter we can perform accounting for arbitrary flows.

The following examples from patch 5/5 show some possibilities.
Userspace syntax is still WIP, especially the listing of state
tables will most likely be seperated from normal set listings
and use a more structured format:

1. Limit the rate of new SSH connections per host, similar to iptables
   hashlimit:

        flow ip saddr timeout 60s \
        limit 10/second \
        accept

2. Account network traffic between each set of /24 networks:

        flow ip saddr &amp; 255.255.255.0 . ip daddr &amp; 255.255.255.0 \
        counter

3. Account traffic to each host per user:

        flow skuid . ip daddr \
        counter

4. Account traffic for each combination of source address and TCP flags:

        flow ip saddr . tcp flags \
        counter

The resulting set content after a Xmas-scan look like this:

{
        192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
        192.168.122.1 . ack : counter packets 74 bytes 3848,
        192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}

In the future the "expressions attached to elements" will be extended
to also support user created non-stateful expressions to allow to
efficiently select beween a set of parameter sets, f.i. a set of log
statements with different prefixes based on the interface, which currently
require one rule each. This will most likely have to wait until the next
kernel version though.
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

A final pull request, I know it's very late but this time I think it's worth a
bit of rush.

The following patchset contains Netfilter/nf_tables updates for net-next, more
specifically concatenation support and dynamic stateful expression
instantiation.

This also comes with a couple of small patches. One to fix the ebtables.h
userspace header and another to get rid of an obsolete example file in tree
that describes a nf_tables expression.

This time, I decided to paste the original descriptions. This will result in a
rather large commit description, but I think these bytes to keep.

Patrick McHardy says:

====================
netfilter: nf_tables: concatenation support

The following patches add support for concatenations, which allow multi
dimensional exact matches in O(1).

The basic idea is to split the data registers, currently consisting of
4 registers of 16 bytes each, into smaller units, 16 registers of 4
bytes each, and making sure each register store always leaves the
full 32 bit in a well defined state, meaning smaller stores will
zero the remaining bits.

Based on that, we can load multiple adjacent registers with different
values, thereby building a concatenated bigger value, and use that
value for set lookups.

Sets are changed to use variable sized extensions for their key and
data values, removing the fixed limit of 16 bytes while saving memory
if less space is needed.

As a side effect, these patches will allow some nice optimizations in
the future, like using jhash2 in nft_hash, removing the masking in
nft_cmp_fast, optimized data comparison using 32 bit word size etc.
These are not done so far however.

The patches are split up as follows:

 * the first five patches add length validation to register loads and
   stores to make sure we stay within bounds and prepare the validation
   functions for the new addressing mode

 * the next patches prepare for changing to 32 bit addressing by
   introducing a struct nft_regs, which holds the verdict register as
   well as the data registers. The verdict members are moved to a new
   struct nft_verdict to allow to pull struct nft_data out of the stack.

 * the next patches contain preparatory conversions of expressions and
   sets to use 32 bit addressing

 * the next patch introduces so far unused register conversion helpers
   for parsing and dumping register numbers over netlink

 * following is the real conversion to 32 bit addressing, consisting of
   replacing struct nft_data in struct nft_regs by an array of u32s and
   actually translating and validating the new register numbers.

 * the final two patches add support for variable sized data items and
   variable sized keys / data in set elements

The patches have been verified to work correctly with nft binaries using
both old and new addressing.
====================

Patrick McHardy says:

====================
netfilter: nf_tables: dynamic stateful expression instantiation

The following patches are the grand finale of my nf_tables set work,
using all the building blocks put in place by the previous patches
to support something like iptables hashlimit, but a lot more powerful.

Sets are extended to allow attaching expressions to set elements.
The dynset expression dynamically instantiates these expressions
based on a template when creating new set elements and evaluates
them for all new or updated set members.

In combination with concatenations this effectively creates state
tables for arbitrary combinations of keys, using the existing
expression types to maintain that state. Regular set GC takes care
of purging expired states.

We currently support two different stateful expressions, counter
and limit. Using limit as a template we can express the functionality
of hashlimit, but completely unrestricted in the combination of keys.
Using counter we can perform accounting for arbitrary flows.

The following examples from patch 5/5 show some possibilities.
Userspace syntax is still WIP, especially the listing of state
tables will most likely be seperated from normal set listings
and use a more structured format:

1. Limit the rate of new SSH connections per host, similar to iptables
   hashlimit:

        flow ip saddr timeout 60s \
        limit 10/second \
        accept

2. Account network traffic between each set of /24 networks:

        flow ip saddr &amp; 255.255.255.0 . ip daddr &amp; 255.255.255.0 \
        counter

3. Account traffic to each host per user:

        flow skuid . ip daddr \
        counter

4. Account traffic for each combination of source address and TCP flags:

        flow ip saddr . tcp flags \
        counter

The resulting set content after a Xmas-scan look like this:

{
        192.168.122.1 . fin | psh | urg : counter packets 1001 bytes 40040,
        192.168.122.1 . ack : counter packets 74 bytes 3848,
        192.168.122.1 . psh | ack : counter packets 35 bytes 3144
}

In the future the "expressions attached to elements" will be extended
to also support user created non-stateful expressions to allow to
efficiently select beween a set of parameter sets, f.i. a set of log
statements with different prefixes based on the interface, which currently
require one rule each. This will most likely have to wait until the next
kernel version though.
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2015-04-13T22:18:05+00:00</updated>
<author>
<name>David S. Miller</name>
<email>davem@davemloft.net</email>
</author>
<published>2015-04-13T22:18:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6e8a9d9148b6dc2305fcaaf60550b81cbb6319c6'/>
<id>6e8a9d9148b6dc2305fcaaf60550b81cbb6319c6</id>
<content type='text'>
Al Viro says:

====================
netdev-related stuff in vfs.git

There are several commits sitting in vfs.git that probably ought to go in
via net-next.git.  First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the -&gt;sendmsg()
and -&gt;recvmsg() patches a while ago.  It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter.  The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO.  Next there's
afs and rxrpc stuff from dhowells.  And then there's sanitizing kernel_sendmsg
et.al.  + missing inlined helper for "how much data is left in msg-&gt;msg_iter" -
this stuff is used in e.g.  cifs stuff, but it belongs in net-next.

That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem

I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Al Viro says:

====================
netdev-related stuff in vfs.git

There are several commits sitting in vfs.git that probably ought to go in
via net-next.git.  First of all, there's merge with vfs.git#iocb - that's
Christoph's aio rework, which has triggered conflicts with the -&gt;sendmsg()
and -&gt;recvmsg() patches a while ago.  It's not so much Christoph's stuff
that ought to be in net-next, as (pretty simple) conflict resolution on merge.
The next chunk is switch to {compat_,}import_iovec/import_single_range - new
safer primitives for initializing iov_iter.  The primitives themselves come
from vfs/git#iov_iter (and they are used quite a lot in vfs part of queue),
conversion of net/socket.c syscalls belongs in net-next, IMO.  Next there's
afs and rxrpc stuff from dhowells.  And then there's sanitizing kernel_sendmsg
et.al.  + missing inlined helper for "how much data is left in msg-&gt;msg_iter" -
this stuff is used in e.g.  cifs stuff, but it belongs in net-next.

That pile is pullable from
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-davem

I'll post the individual patches in there in followups; could you take a look
and tell if everything in there is OK with you?
====================

Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tcp/dccp: get rid of central timewait timer</title>
<updated>2015-04-13T20:40:05+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2015-04-13T01:51:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=789f558cfb3680aeb52de137418637f6b04b7d22'/>
<id>789f558cfb3680aeb52de137418637f6b04b7d22</id>
<content type='text'>
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Using a timer wheel for timewait sockets was nice ~15 years ago when
memory was expensive and machines had a single processor.

This does not scale, code is ugly and source of huge latencies
(Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

We can afford to use an extra 64 bytes per timewait sock and spread
timewait load to all cpus to have better behavior.

Tested:

On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
on the target (lpaa24)

Before patch :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
419594

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
437171

While test is running, we can observe 25 or even 33 ms latencies.

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

After patch :

About 90% increase of throughput :

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
810442

lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
800992

And latencies are kept to minimal values during this load, even
if network utilization is 90% higher :

lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
...
1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>netfilter: nf_tables: mark stateful expressions</title>
<updated>2015-04-13T18:12:31+00:00</updated>
<author>
<name>Patrick McHardy</name>
<email>kaber@trash.net</email>
</author>
<published>2015-04-11T09:46:40+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=151d799a61da1b6f6b7e5116fb776177917bbe9a'/>
<id>151d799a61da1b6f6b7e5116fb776177917bbe9a</id>
<content type='text'>
Add a flag to mark stateful expressions.

This is used for dynamic expression instanstiation to limit the usable
expressions. Strictly speaking only the dynset expression can not be
used in order to avoid recursion, but since dynamically instantiating
non-stateful expressions will simply create an identical copy, which
behaves no differently than the original, this limits to expressions
where it actually makes sense to dynamically instantiate them.

Signed-off-by: Patrick McHardy &lt;kaber@trash.net&gt;
Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add a flag to mark stateful expressions.

This is used for dynamic expression instanstiation to limit the usable
expressions. Strictly speaking only the dynset expression can not be
used in order to avoid recursion, but since dynamically instantiating
non-stateful expressions will simply create an identical copy, which
behaves no differently than the original, this limits to expressions
where it actually makes sense to dynamically instantiate them.

Signed-off-by: Patrick McHardy &lt;kaber@trash.net&gt;
Signed-off-by: Pablo Neira Ayuso &lt;pablo@netfilter.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
