linux-stable.git/include/linux, branch linux-3.10.y

workqueue: implicit ordered attribute should be overridable

2017-11-02T09:45:58+00:00

commit 0a94efb5acbb6980d7c9ab604372d93cd507e4d8 upstream.

5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
ordered") automatically enabled ordered attribute for unbound
workqueues w/ max_active == 1.  Because ordered workqueues reject
max_active and some attribute changes, this implicit ordered mode
broke cases where the user creates an unbound workqueue w/ max_active
== 1 and later explicitly changes the related attributes.

This patch distinguishes explicit and implicit ordered setting and
overrides from attribute changes if implict.

Signed-off-by: Tejun Heo 
Fixes: 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
Cc: Holger Hoffstätte 
Signed-off-by: Greg Kroah-Hartman 

Signed-off-by: Willy Tarreau

KEYS: prevent creating a different user's keyrings

2017-11-02T06:16:18+00:00

commit 237bbd29f7a049d310d907f4b2716a7feef9abf3 upstream.

It was possible for an unprivileged user to create the user and user
session keyrings for another user.  For example:

    sudo -u '#3000' sh -c 'keyctl add keyring _uid.4000 "" @u
                           keyctl add keyring _uid_ses.4000 "" @u
                           sleep 15' &
    sleep 1
    sudo -u '#4000' keyctl describe @u
    sudo -u '#4000' keyctl describe @us

This is problematic because these "fake" keyrings won't have the right
permissions.  In particular, the user who created them first will own
them and will have full access to them via the possessor permissions,
which can be used to compromise the security of a user's keys:

    -4: alswrv-----v------------  3000     0 keyring: _uid.4000
    -5: alswrv-----v------------  3000     0 keyring: _uid_ses.4000

Fix it by marking user and user session keyrings with a flag
KEY_FLAG_UID_KEYRING.  Then, when searching for a user or user session
keyring by name, skip all keyrings that don't have the flag set.

Fixes: 69664cf16af4 ("keys: don't generate user and user session keyrings unless they're accessed")
Cc: 	[v2.6.26+]
Signed-off-by: Eric Biggers 
Signed-off-by: David Howells 
[wt: adjust context]

Signed-off-by: Willy Tarreau

mm: larger stack guard gap, between vmas

2017-06-21T13:42:43+00:00

commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov 
Original-patch-by: Michal Hocko 
Signed-off-by: Hugh Dickins 
[wt: backport to 4.11: adjust context]
[wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
[wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes]
[wt: backport to 3.18: adjust context ; no FOLL_POPULATE ;
     s390 uses generic arch_get_unmapped_area()]
[wt: backport to 3.16: adjust context]
[wt: backport to 3.10: adjust context ; code logic in PARISC's
     arch_get_unmapped_area() wasn't found ; code inserted into
     expand_upwards() and expand_downwards() runs under anon_vma lock;
     changes for gup.c:faultin_page go to memory.c:__get_user_pages();
     included Hugh Dickins' fixes]
Signed-off-by: Willy Tarreau

give up on gcc ilog2() constant optimizations

2017-06-20T12:04:32+00:00

commit 474c90156c8dcc2fa815e6716cc9394d7930cb9c upstream.

gcc-7 has an "optimization" pass that completely screws up, and
generates the code expansion for the (impossible) case of calling
ilog2() with a zero constant, even when the code gcc compiles does not
actually have a zero constant.

And we try to generate a compile-time error for anybody doing ilog2() on
a constant where that doesn't make sense (be it zero or negative).  So
now gcc7 will fail the build due to our sanity checking, because it
created that constant-zero case that didn't actually exist in the source
code.

There's a whole long discussion on the kernel mailing about how to work
around this gcc bug.  The gcc people themselevs have discussed their
"feature" in

   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=72785

but it's all water under the bridge, because while it looked at one
point like it would be solved by the time gcc7 was released, that was
not to be.

So now we have to deal with this compiler braindamage.

And the only simple approach seems to be to just delete the code that
tries to warn about bad uses of ilog2().

So now "ilog2()" will just return 0 not just for the value 1, but for
any non-positive value too.

It's not like I can recall anybody having ever actually tried to use
this function on any invalid value, but maybe the sanity check just
meant that such code never made it out in public.

[js] no tools/include/linux/log2.h copy of that yet

Reported-by: Laura Abbott 
Cc: John Stultz ,
Cc: Thomas Gleixner 
Cc: Ard Biesheuvel 
Signed-off-by: Linus Torvalds 
Signed-off-by: Jiri Slaby 
Signed-off-by: Willy Tarreau

nfs: Don't increment lock sequence ID after NFS4ERR_MOVED

2017-06-20T12:04:16+00:00

commit 059aa734824165507c65fd30a55ff000afd14983 upstream.

Xuan Qi reports that the Linux NFSv4 client failed to lock a file
that was migrated. The steps he observed on the wire:

1. The client sent a LOCK request to the source server
2. The source server replied NFS4ERR_MOVED
3. The client switched to the destination server
4. The client sent the same LOCK request to the destination
   server with a bumped lock sequence ID
5. The destination server rejected the LOCK request with
   NFS4ERR_BAD_SEQID

RFC 3530 section 8.1.5 provides a list of NFS errors which do not
bump a lock sequence ID.

However, RFC 3530 is now obsoleted by RFC 7530. In RFC 7530 section
9.1.7, this list has been updated by the addition of NFS4ERR_MOVED.

Reported-by: Xuan Qi 
Signed-off-by: Chuck Lever 
Signed-off-by: Trond Myklebust 
Signed-off-by: Willy Tarreau

cred/userns: define current_user_ns() as a function

2017-06-20T12:03:25+00:00

commit 0335695dfa4df01edff5bb102b9a82a0668ee51e upstream.

The current_user_ns() macro currently returns &init_user_ns when user
namespaces are disabled, and that causes several warnings when building
with gcc-6.0 in code that compares the result of the macro to
&init_user_ns itself:

  fs/xfs/xfs_ioctl.c: In function 'xfs_ioctl_setattr_check_projid':
  fs/xfs/xfs_ioctl.c:1249:22: error: self-comparison always evaluates to true [-Werror=tautological-compare]
    if (current_user_ns() == &init_user_ns)

This is a legitimate warning in principle, but here it isn't really
helpful, so I'm reprasing the definition in a way that shuts up the
warning.  Apparently gcc only warns when comparing identical literals,
but it can figure out that the result of an inline function can be
identical to a constant expression in order to optimize a condition yet
not warn about the fact that the condition is known at compile time.
This is exactly what we want here, and it looks reasonable because we
generally prefer inline functions over macros anyway.

Signed-off-by: Arnd Bergmann 
Acked-by: Serge Hallyn 
Cc: David Howells 
Cc: Yaowei Bai 
Cc: James Morris 
Cc: "Paul E. McKenney" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Willy Tarreau

KVM: kvm_io_bus_unregister_dev() should never fail

2017-06-07T22:47:11+00:00

commit 90db10434b163e46da413d34db8d0e77404cc645 upstream.

No caller currently checks the return value of
kvm_io_bus_unregister_dev(). This is evil, as all callers silently go on
freeing their device. A stale reference will remain in the io_bus,
getting at least used again, when the iobus gets teared down on
kvm_destroy_vm() - leading to use after free errors.

There is nothing the callers could do, except retrying over and over
again.

So let's simply remove the bus altogether, print an error and make
sure no one can access this broken bus again (returning -ENOMEM on any
attempt to access it).

Fixes: e93f8a0f821e ("KVM: convert io_bus to SRCU")
Reported-by: Dmitry Vyukov 
Reviewed-by: Cornelia Huck 
Signed-off-by: David Hildenbrand 
Signed-off-by: Paolo Bonzini 
[wt: no kvm_io_bus_read_cookie in 3.10, slightly different constructs]

Signed-off-by: Willy Tarreau

kvm: exclude ioeventfd from counting kvm_io_range limit

2017-06-07T22:47:11+00:00

commit 6ea34c9b78c10289846db0abeebd6b84d5aca084 upstream.

We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).

In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.

Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.

Also fixed one indent issue in kvm_host.h

Signed-off-by: Amos Kong 
Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Gleb Natapov 
[wt: next patch depends on this one]
Signed-off-by: Willy Tarreau

can: Fix kernel panic at security_sock_rcv_skb

2017-06-07T22:47:09+00:00

commit f1712c73714088a7252d276a57126d56c7d37e64 upstream.

Zhang Yanmin reported crashes [1] and provided a patch adding a
synchronize_rcu() call in can_rx_unregister()

The main problem seems that the sockets themselves are not RCU
protected.

If CAN uses RCU for delivery, then sockets should be freed only after
one RCU grace period.

Recent kernels could use sock_set_flag(sk, SOCK_RCU_FREE), but let's
ease stable backports with the following fix instead.

[1]
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] selinux_socket_sock_rcv_skb+0x65/0x2a0

Call Trace:
 
 [] security_sock_rcv_skb+0x4c/0x60
 [] sk_filter+0x41/0x210
 [] sock_queue_rcv_skb+0x53/0x3a0
 [] raw_rcv+0x2a3/0x3c0
 [] can_rcv_filter+0x12b/0x370
 [] can_receive+0xd9/0x120
 [] can_rcv+0xab/0x100
 [] __netif_receive_skb_core+0xd8c/0x11f0
 [] __netif_receive_skb+0x24/0xb0
 [] process_backlog+0x127/0x280
 [] net_rx_action+0x33b/0x4f0
 [] __do_softirq+0x184/0x440
 [] do_softirq_own_stack+0x1c/0x30
 
 [] do_softirq.part.18+0x3b/0x40
 [] do_softirq+0x1d/0x20
 [] netif_rx_ni+0xe5/0x110
 [] slcan_receive_buf+0x507/0x520
 [] flush_to_ldisc+0x21c/0x230
 [] process_one_work+0x24f/0x670
 [] worker_thread+0x9d/0x6f0
 [] ? rescuer_thread+0x480/0x480
 [] kthread+0x12c/0x150
 [] ret_from_fork+0x3f/0x70

Reported-by: Zhang Yanmin 
Signed-off-by: Eric Dumazet 
Acked-by: Oliver Hartkopp 
Signed-off-by: David S. Miller 
Signed-off-by: Willy Tarreau

ACPI / PNP: Reserve ACPI resources at the fs_initcall_sync stage

2017-06-07T22:47:06+00:00

commit 0294112ee3135fbd15eaa70015af8283642dd970 upstream.

This effectively reverts the following three commits:

 7bc10388ccdd ACPI / resources: free memory on error in add_region_before()
 0f1b414d1907 ACPI / PNP: Avoid conflicting resource reservations
 b9a5e5e18fbf ACPI / init: Fix the ordering of acpi_reserve_resources()

(commit b9a5e5e18fbf introduced regressions some of which, but not
all, were addressed by commit 0f1b414d1907 and commit 7bc10388ccdd
was a fixup on top of the latter) and causes ACPI fixed hardware
resources to be reserved at the fs_initcall_sync stage of system
initialization.

The story is as follows.  First, a boot regression was reported due
to an apparent resource reservation ordering change after a commit
that shouldn't lead to such changes.  Investigation led to the
conclusion that the problem happened because acpi_reserve_resources()
was executed at the device_initcall() stage of system initialization
which wasn't strictly ordered with respect to driver initialization
(and with respect to the initialization of the pcieport driver in
particular), so a random change causing the device initcalls to be
run in a different order might break things.

The response to that was to attempt to run acpi_reserve_resources()
as soon as we knew that ACPI would be in use (commit b9a5e5e18fbf).
However, that turned out to be too early, because it caused resource
reservations made by the PNP system driver to fail on at least one
system and that failure was addressed by commit 0f1b414d1907.

That fix still turned out to be insufficient, though, because
calling acpi_reserve_resources() before the fs_initcall stage of
system initialization caused a boot regression to happen on the
eCAFE EC-800-H20G/S netbook.  That meant that we only could call
acpi_reserve_resources() at the fs_initcall initialization stage
or later, but then we might just as well call it after the PNP
initalization in which case commit 0f1b414d1907 wouldn't be
necessary any more.

For this reason, the changes made by commit 0f1b414d1907 are reverted
(along with a memory leak fixup on top of that commit), the changes
made by commit b9a5e5e18fbf that went too far are reverted too and
acpi_reserve_resources() is changed into fs_initcall_sync, which
will cause it to be executed after the PNP subsystem initialization
(which is an fs_initcall) and before device initcalls (including
the pcieport driver initialization) which should avoid the initial
issue.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=100581
Link: http://marc.info/?t=143092384600002&r=1&w=2
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99831
Link: http://marc.info/?t=143389402600001&r=1&w=2
Fixes: b9a5e5e18fbf "ACPI / init: Fix the ordering of acpi_reserve_resources()"
Reported-by: Roland Dreier 
Signed-off-by: Rafael J. Wysocki 
Signed-off-by: Willy Tarreau