linux-stable.git/include/linux, branch v4.4.9

numa: fix /proc//numa_maps for THP

2016-05-04T21:48:49+00:00

commit 28093f9f34cedeaea0f481c58446d9dac6dd620f upstream.

In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
because the layouts may differ depending on the architecture.  On s390
this will lead to inaccurate numa_maps accounting in /proc because of
misguided pte_present() and pte_dirty() checks on the fake pte.

On other architectures pte_present() and pte_dirty() may work by chance,
but there may be an issue with direct-access (dax) mappings w/o
underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
available.  In vm_normal_page() the fake pte will be checked with
pte_special() and because there is no "special" bit in a pmd, this will
always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
skipped.  On dax mappings w/o struct pages, an invalid struct page
pointer would then be returned that can crash the kernel.

This patch fixes the numa_maps THP handling by introducing new "_pmd"
variants of the can_gather_numa_stats() and vm_normal_page() functions.

Signed-off-by: Gerald Schaefer 
Cc: Naoya Horiguchi 
Cc: "Kirill A . Shutemov" 
Cc: Konstantin Khlebnikov 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Jerome Marchand 
Cc: Johannes Weiner 
Cc: Dave Hansen 
Cc: Mel Gorman 
Cc: Dan Williams 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: Michael Holzheu 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback

2016-05-04T21:48:49+00:00

commit 5cf1cacb49aee39c3e02ae87068fc3c6430659b0 upstream.

Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo 
Cc: Li Zefan 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Signed-off-by: Greg Kroah-Hartman

IB/mlx5: Expose correct max_sge_rd limit

2016-05-04T21:48:48+00:00

commit 986ef95ecdd3eb6fa29433e68faa94c7624083be upstream.

mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation
where rdma read work queue entries cannot exceed 512 bytes.
A rdma_read wqe needs to fit in 512 bytes:
- wqe control segment (16 bytes)
- rdma segment (16 bytes)
- scatter elements (16 bytes each)

So max_sge_rd should be: (512 - 16 - 16) / 16 = 30.

Reported-by: Christoph Hellwig 
Tested-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
Signed-off-by: Leon Romanovsky 
Signed-off-by: Doug Ledford 
Signed-off-by: Greg Kroah-Hartman

Revert "PCI: Add helpers to manage pci_dev->irq and pci_dev->irq_managed"

2016-04-20T06:42:16+00:00

commit 67b4eab91caf2ad574cab1b17ae09180ea2e116e upstream.

Revert 811a4e6fce09 ("PCI: Add helpers to manage pci_dev->irq and
pci_dev->irq_managed").

This is part of reverting 991de2e59090 ("PCI, x86: Implement
pcibios_alloc_irq() and pcibios_free_irq()") to fix regressions it
introduced.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=111211
Fixes: 991de2e59090 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()")
Signed-off-by: Bjorn Helgaas 
Acked-by: Rafael J. Wysocki 
CC: Jiang Liu 
Signed-off-by: Greg Kroah-Hartman

fs: add file_dentry()

2016-04-20T06:42:12+00:00

commit d101a125954eae1d397adda94ca6319485a50493 upstream.

This series fixes bugs in nfs and ext4 due to 4bacc9c9234c ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").

Regular files opened on overlayfs will result in the file being opened on
the underlying filesystem, while f_path points to the overlayfs
mount/dentry.

This confuses filesystems which get the dentry from struct file and assume
it's theirs.

Add a new helper, file_dentry() [*], to get the filesystem's own dentry
from the file.  This checks file->f_path.dentry->d_flags against
DCACHE_OP_REAL, and returns file->f_path.dentry if DCACHE_OP_REAL is not
set (this is the common, non-overlayfs case).

In the uncommon case it will call into overlayfs's ->d_real() to get the
underlying dentry, matching file_inode(file).

The reason we need to check against the inode is that if the file is copied
up while being open, d_real() would return the upper dentry, while the open
file comes from the lower dentry.

[*] If possible, it's better simply to use file_inode() instead.

Signed-off-by: Miklos Szeredi 
Signed-off-by: Theodore Ts'o 
Tested-by: Goldwyn Rodrigues 
Reviewed-by: Trond Myklebust 
Cc: David Howells 
Cc: Al Viro 
Cc: Daniel Axtens 
Signed-off-by: Greg Kroah-Hartman

USB: uas: Add a new NO_REPORT_LUNS quirk

2016-04-20T06:42:07+00:00

commit 1363074667a6b7d0507527742ccd7bbed5e3ceaa upstream.

Add a new NO_REPORT_LUNS quirk and set it for Seagate drives with
an usb-id of: 0bc2:331a, as these will fail to respond to a
REPORT_LUNS command.

Reported-and-tested-by: David Webb 
Signed-off-by: Hans de Goede 
Acked-by: Alan Stern 
Signed-off-by: Greg Kroah-Hartman

tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter

2016-04-20T06:42:06+00:00

[ Upstream commit 5a5abb1fa3b05dd6aa821525832644c1e7d2905f ]

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

  [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
  [   52.765688] other info that might help us debug this:
  [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
  [   52.765701] 1 lock held by a.out/1525:
  [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
  [   52.765721] stack backtrace:
  [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
  [...]
  [   52.765768] Call Trace:
  [   52.765775]  [] dump_stack+0x85/0xc8
  [   52.765784]  [] lockdep_rcu_suspicious+0xd5/0x110
  [   52.765792]  [] sk_detach_filter+0x82/0x90
  [   52.765801]  [] tun_detach_filter+0x35/0x90 [tun]
  [   52.765810]  [] __tun_chr_ioctl+0x354/0x1130 [tun]
  [   52.765818]  [] ? selinux_file_ioctl+0x130/0x210
  [   52.765827]  [] tun_chr_ioctl+0x13/0x20 [tun]
  [   52.765834]  [] do_vfs_ioctl+0x96/0x690
  [   52.765843]  [] ? security_file_ioctl+0x43/0x60
  [   52.765850]  [] SyS_ioctl+0x79/0x90
  [   52.765858]  [] do_syscall_64+0x62/0x140
  [   52.765866]  [] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 994051625981 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.

Reported-by: Sasha Levin 
Signed-off-by: Daniel Borkmann 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

bridge: allow zero ageing time

2016-04-20T06:42:02+00:00

[ Upstream commit 4c656c13b254d598e83e586b7b4d36a2043dad85 ]

This fixes a regression in the bridge ageing time caused by:
commit c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev")

There are users of Linux bridge which use the feature that if ageing time
is set to 0 it causes entries to never expire. See:
  https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge

For a pure software bridge, it is unnecessary for the code to have
arbitrary restrictions on what values are allowable.

Signed-off-by: Stephen Hemminger 
Acked-by: Jiri Pirko 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: validate variable length ll headers

2016-04-20T06:42:00+00:00

[ Upstream commit 2793a23aacbd754dbbb5cb75093deb7e4103bace ]

Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.

Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.

Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.

See also http://comments.gmane.org/gmane.linux.network/401064

Signed-off-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

mld, igmp: Fix reserved tailroom calculation

2016-04-20T06:41:58+00:00

[ Upstream commit 1837b2e2bcd23137766555a63867e649c0b637f0 ]

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier 
Acked-by: Hannes Frederic Sowa 
Acked-by: Daniel Borkmann 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman