summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2019-02-06of: overlay: add tests to validate kfrees from overlay removalFrank Rowand
commit 144552c786925314c1e7cb8f91a71dae1aca8798 upstream. Add checks: - attempted kfree due to refcount reaching zero before overlay is removed - properties linked to an overlay node when the node is removed - node refcount > one during node removal in a changeset destroy, if the node was created by the changeset After applying this patch, several validation warnings will be reported from the devicetree unittest during boot due to pre-existing devicetree bugs. The warnings will be similar to: OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11 OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/ hvac-medium-2 Tested-by: Alan Tull <atull@kernel.org> Signed-off-by: Frank Rowand <frank.rowand@sony.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06oom, oom_reaper: do not enqueue same task twiceTetsuo Handa
commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream. Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b018325737f76 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-06ipvlan, l3mdev: fix broken l3s mode wrt local routesDaniel Borkmann
[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ] While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, I ran into the issue that while l3 mode is working fine, l3s mode does not have any connectivity to kube-apiserver and hence all pods end up in Error state as well. The ipvlan master device sits on top of a bond device and hostns traffic to kube-apiserver (also running in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 where the latter is the address of the bond0. While in l3 mode, a curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 works fine from hostns, neither of them do in case of l3s. In the latter only a curl to https://127.0.0.1:37573 appeared to work where for local addresses of bond0 I saw kernel suddenly starting to emit ARP requests to query HW address of bond0 which remained unanswered and neighbor entries in INCOMPLETE state. These ARP requests only happen while in l3s. Debugging this further, I found the issue is that l3s mode is piggy- backing on l3 master device, and in this case local routes are using l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback"). I found that reverting them back into using the net->loopback_dev fixed ipvlan l3s connectivity and got everything working for the CNI. Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the l3mdev paper in [0] the only sole reason why ipvlan l3s is relying on l3 master device is to get the l3mdev_ip_rcv() receive hook for setting the dst entry of the input route without adding its own ipvlan specific hacks into the receive path, however, any l3 domain semantics beyond just that are breaking l3s operation. Note that ipvlan also has the ability to dynamically switch its internal operation from l3 to l3s for all ports via ipvlan_set_port_mode() at runtime. In any case, l3 vs l3s soley distinguishes itself by 'de-confusing' netfilter through switching skb->dev to ipvlan slave device late in NF_INET_LOCAL_IN before handing the skb to L4. Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook without any additional l3mdev semantics on top. This should also have minimal impact since dev->priv_flags is already hot in cache. With this set, l3s mode is working fine and I also get things like masquerading pod traffic on the ipvlan master properly working. [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31Drivers: hv: vmbus: Remove the useless API vmbus_get_outgoing_channel()Dexuan Cui
[ Upstream commit 4d3c5c69191f98c7f7e699ff08d2fd96d7070ddb ] Commit d86adf482b84 ("scsi: storvsc: Enable multi-queue support") removed the usage of the API in Jan 2017, and the API is not used since then. netvsc and storvsc have their own algorithms to determine the outgoing channel, so this API is useless. And the API is potentially unsafe, because it reads primary->num_sc without any lock held. This can be risky considering the RESCIND-OFFER message. Let's remove the API. Cc: Long Li <longli@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-31bpf: fix sanitation of alu op with pointer / scalar type from different pathsDaniel Borkmann
[ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ] While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") took care of rejecting alu op on pointer when e.g. pointer came from two different map values with different map properties such as value size, Jann reported that a case was not covered yet when a given alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from different branches where we would incorrectly try to sanitize based on the pointer's limit. Catch this corner case and reject the program instead. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-31bpf: prevent out of bounds speculation on pointer arithmeticDaniel Borkmann
[ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ] Jann reported that the original commit back in b2157399cc98 ("bpf: prevent out-of-bounds speculation") was not sufficient to stop CPU from speculating out of bounds memory access: While b2157399cc98 only focussed on masking array map access for unprivileged users for tail calls and data access such that the user provided index gets sanitized from BPF program and syscall side, there is still a more generic form affected from BPF programs that applies to most maps that hold user data in relation to dynamic map access when dealing with unknown scalars or "slow" known scalars as access offset, for example: - Load a map value pointer into R6 - Load an index into R7 - Do a slow computation (e.g. with a memory dependency) that loads a limit into R8 (e.g. load the limit from a map for high latency, then mask it to make the verifier happy) - Exit if R7 >= R8 (mispredicted branch) - Load R0 = R6[R7] - Load R0 = R6[R0] For unknown scalars there are two options in the BPF verifier where we could derive knowledge from in order to guarantee safe access to the memory: i) While </>/<=/>= variants won't allow to derive any lower or upper bounds from the unknown scalar where it would be safe to add it to the map value pointer, it is possible through ==/!= test however. ii) another option is to transform the unknown scalar into a known scalar, for example, through ALU ops combination such as R &= <imm> followed by R |= <imm> or any similar combination where the original information from the unknown scalar would be destroyed entirely leaving R with a constant. The initial slow load still precedes the latter ALU ops on that register, so the CPU executes speculatively from that point. Once we have the known scalar, any compare operation would work then. A third option only involving registers with known scalars could be crafted as described in [0] where a CPU port (e.g. Slow Int unit) would be filled with many dependent computations such that the subsequent condition depending on its outcome has to wait for evaluation on its execution port and thereby executing speculatively if the speculated code can be scheduled on a different execution port, or any other form of mistraining as described in [1], for example. Given this is not limited to only unknown scalars, not only map but also stack access is affected since both is accessible for unprivileged users and could potentially be used for out of bounds access under speculation. In order to prevent any of these cases, the verifier is now sanitizing pointer arithmetic on the offset such that any out of bounds speculation would be masked in a way where the pointer arithmetic result in the destination register will stay unchanged, meaning offset masked into zero similar as in array_index_nospec() case. With regards to implementation, there are three options that were considered: i) new insn for sanitation, ii) push/pop insn and sanitation as inlined BPF, iii) reuse of ax register and sanitation as inlined BPF. Option i) has the downside that we end up using from reserved bits in the opcode space, but also that we would require each JIT to emit masking as native arch opcodes meaning mitigation would have slow adoption till everyone implements it eventually which is counter-productive. Option ii) and iii) have both in common that a temporary register is needed in order to implement the sanitation as inlined BPF since we are not allowed to modify the source register. While a push / pop insn in ii) would be useful to have in any case, it requires once again that every JIT needs to implement it first. While possible, amount of changes needed would also be unsuitable for a -stable patch. Therefore, the path which has fewer changes, less BPF instructions for the mitigation and does not require anything to be changed in the JITs is option iii) which this work is pursuing. The ax register is already mapped to a register in all JITs (modulo arm32 where it's mapped to stack as various other BPF registers there) and used in constant blinding for JITs-only so far. It can be reused for verifier rewrites under certain constraints. The interpreter's tmp "register" has therefore been remapped into extending the register set with hidden ax register and reusing that for a number of instructions that needed the prior temporary variable internally (e.g. div, mod). This allows for zero increase in stack space usage in the interpreter, and enables (restricted) generic use in rewrites otherwise as long as such a patchlet does not make use of these instructions. The sanitation mask is dynamic and relative to the offset the map value or stack pointer currently holds. There are various cases that need to be taken under consideration for the masking, e.g. such operation could look as follows: ptr += val or val += ptr or ptr -= val. Thus, the value to be sanitized could reside either in source or in destination register, and the limit is different depending on whether the ALU op is addition or subtraction and depending on the current known and bounded offset. The limit is derived as follows: limit := max_value_size - (smin_value + off). For subtraction: limit := umax_value + off. This holds because we do not allow any pointer arithmetic that would temporarily go out of bounds or would have an unknown value with mixed signed bounds where it is unclear at verification time whether the actual runtime value would be either negative or positive. For example, we have a derived map pointer value with constant offset and bounded one, so limit based on smin_value works because the verifier requires that statically analyzed arithmetic on the pointer must be in bounds, and thus it checks if resulting smin_value + off and umax_value + off is still within map value bounds at time of arithmetic in addition to time of access. Similarly, for the case of stack access we derive the limit as follows: MAX_BPF_STACK + off for subtraction and -off for the case of addition where off := ptr_reg->off + ptr_reg->var_off.value. Subtraction is a special case for the masking which can be in form of ptr += -val, ptr -= -val, or ptr -= val. In the first two cases where we know that the value is negative, we need to temporarily negate the value in order to do the sanitation on a positive value where we later swap the ALU op, and restore original source register if the value was in source. The sanitation of pointer arithmetic alone is still not fully sufficient as is, since a scenario like the following could happen ... PTR += 0x1000 (e.g. K-based imm) PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON PTR += 0x1000 PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON [...] ... which under speculation could end up as ... PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] [...] ... and therefore still access out of bounds. To prevent such case, the verifier is also analyzing safety for potential out of bounds access under speculative execution. Meaning, it is also simulating pointer access under truncation. We therefore "branch off" and push the current verification state after the ALU operation with known 0 to the verification stack for later analysis. Given the current path analysis succeeded it is likely that the one under speculation can be pruned. In any case, it is also subject to existing complexity limits and therefore anything beyond this point will be rejected. In terms of pruning, it needs to be ensured that the verification state from speculative execution simulation must never prune a non-speculative execution path, therefore, we mark verifier state accordingly at the time of push_stack(). If verifier detects out of bounds access under speculative execution from one of the possible paths that includes a truncation, it will reject such program. Given we mask every reg-based pointer arithmetic for unprivileged programs, we've been looking into how it could affect real-world programs in terms of size increase. As the majority of programs are targeted for privileged-only use case, we've unconditionally enabled masking (with its alu restrictions on top of it) for privileged programs for the sake of testing in order to check i) whether they get rejected in its current form, and ii) by how much the number of instructions and size will increase. We've tested this by using Katran, Cilium and test_l4lb from the kernel selftests. For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb we've used test_l4lb.o as well as test_l4lb_noinline.o. We found that none of the programs got rejected by the verifier with this change, and that impact is rather minimal to none. balancer_kern.o had 13,904 bytes (1,738 insns) xlated and 7,797 bytes JITed before and after the change. Most complex program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated and 18,538 bytes JITed before and after and none of the other tail call programs in bpf_lxc.o had any changes either. For the older bpf_lxc_opt_-DUNKNOWN.o object we found a small increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed after the change. Other programs from that object file had similar small increase. Both test_l4lb.o had no change and remained at 6,544 bytes (817 insns) xlated and 3,401 bytes JITed and for test_l4lb_noinline.o constant at 5,080 bytes (634 insns) xlated and 3,313 bytes JITed. This can be explained in that LLVM typically optimizes stack based pointer arithmetic by using K-based operations and that use of dynamic map access is not overly frequent. However, in future we may decide to optimize the algorithm further under known guarantees from branch and value speculation. Latter seems also unclear in terms of prediction heuristics that today's CPUs apply as well as whether there could be collisions in e.g. the predictor's Value History/Pattern Table for triggering out of bounds access, thus masking is performed unconditionally at this point but could be subject to relaxation later on. We were generally also brainstorming various other approaches for mitigation, but the blocker was always lack of available registers at runtime and/or overhead for runtime tracking of limits belonging to a specific pointer. Thus, we found this to be minimally intrusive under given constraints. With that in place, a simple example with sanitized access on unprivileged load at post-verification time looks as follows: # bpftool prog dump xlated id 282 [...] 28: (79) r1 = *(u64 *)(r7 +0) 29: (79) r2 = *(u64 *)(r7 +8) 30: (57) r1 &= 15 31: (79) r3 = *(u64 *)(r0 +4608) 32: (57) r3 &= 1 33: (47) r3 |= 1 34: (2d) if r2 > r3 goto pc+19 35: (b4) (u32) r11 = (u32) 20479 | 36: (1f) r11 -= r2 | Dynamic sanitation for pointer 37: (4f) r11 |= r2 | arithmetic with registers 38: (87) r11 = -r11 | containing bounded or known 39: (c7) r11 s>>= 63 | scalars in order to prevent 40: (5f) r11 &= r2 | out of bounds speculation. 41: (0f) r4 += r11 | 42: (71) r4 = *(u8 *)(r4 +0) 43: (6f) r4 <<= r1 [...] For the case where the scalar sits in the destination register as opposed to the source register, the following code is emitted for the above example: [...] 16: (b4) (u32) r11 = (u32) 20479 17: (1f) r11 -= r2 18: (4f) r11 |= r2 19: (87) r11 = -r11 20: (c7) r11 s>>= 63 21: (5f) r2 &= r11 22: (0f) r2 += r0 23: (61) r0 = *(u32 *)(r2 +0) [...] JIT blinding example with non-conflicting use of r10: [...] d5: je 0x0000000000000106 _ d7: mov 0x0(%rax),%edi | da: mov $0xf153246,%r10d | Index load from map value and e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f. e7: and %r10,%rdi |_ ea: mov $0x2f,%r10d | f0: sub %rdi,%r10 | Sanitized addition. Both use r10 f3: or %rdi,%r10 | but do not interfere with each f6: neg %r10 | other. (Neither do these instructions f9: sar $0x3f,%r10 | interfere with the use of ax as temp fd: and %r10,%rdi | in interpreter.) 100: add %rax,%rdi |_ 103: mov 0x0(%rdi),%eax [...] Tested that it fixes Jann's reproducer, and also checked that test_verifier and test_progs suite with interpreter, JIT and JIT with hardening enabled on x86-64 and arm64 runs successfully. [0] Speculose: Analyzing the Security Implications of Speculative Execution in CPUs, Giorgi Maisuradze and Christian Rossow, https://arxiv.org/pdf/1801.04084.pdf [1] A Systematic Evaluation of Transient Execution Attacks and Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, Daniel Gruss, https://arxiv.org/pdf/1811.05441.pdf Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-31bpf: enable access to ax register also from verifier rewriteDaniel Borkmann
[ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ] Right now we are using BPF ax register in JIT for constant blinding as well as in interpreter as temporary variable. Verifier will not be able to use it simply because its use will get overridden from the former in bpf_jit_blind_insn(). However, it can be made to work in that blinding will be skipped if there is prior use in either source or destination register on the instruction. Taking constraints of ax into account, the verifier is then open to use it in rewrites under some constraints. Note, ax register already has mappings in every eBPF JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-31bpf: move tmp variable into ax register in interpreterDaniel Borkmann
[ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ] This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run() into the hidden ax register. The latter is currently only used in JITs for constant blinding as a temporary scratch register, meaning the BPF interpreter will never see the use of ax. Therefore it is safe to use it for the cases where tmp has been used earlier. This is needed to later on allow restricted hidden use of ax in both interpreter and JITs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-31bpf: move {prev_,}insn_idx into verifier envDaniel Borkmann
[ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ] Move prev_insn_idx and insn_idx from the do_check() function into the verifier environment, so they can be read inside the various helper functions for handling the instructions. It's easier to put this into the environment rather than changing all call-sites only to pass it along. insn_idx is useful in particular since this later on allows to hold state in env->insn_aux_data[env->insn_idx]. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-31Drivers: hv: vmbus: Check for ring when getting debug infoDexuan Cui
commit ba50bf1ce9a51fc97db58b96d01306aa70bc3979 upstream. fc96df16a1ce is good and can already fix the "return stack garbage" issue, but let's also improve hv_ringbuffer_get_debuginfo(), which would silently return stack garbage, if people forget to check channel->state or ring_info->ring_buffer, when using the function in the future. Having an error check in the function would eliminate the potential risk. Add a Fixes tag to indicate the patch depdendency. Fixes: fc96df16a1ce ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels") Cc: stable@vger.kernel.org Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31net: phy: phy driver features are mandatoryCamelia Groza
[ Upstream commit 3e64cf7a435ed0500e3adaa8aada2272d3ae8abc ] Since phy driver features became a link_mode bitmap, phy drivers that don't have a list of features configured will cause the kernel to crash when probed. Prevent the phy driver from registering if the features field is missing. Fixes: 719655a14971 ("net: phy: Replace phy driver features u32 with link_mode bitmap") Reported-by: Scott Wood <oss@buserror.net> Signed-off-by: Camelia Groza <camelia.groza@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31net: Fix usage of pskb_trim_rcsumRoss Lagerwall
[ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ] In certain cases, pskb_trim_rcsum() may change skb pointers. Reinitialize header pointers afterwards to avoid potential use-after-frees. Add a note in the documentation of pskb_trim_rcsum(). Found by KASAN. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-26mm/memblock.c: skip kmemleak for kasan_init()Qian Cai
[ Upstream commit fed84c78527009d4f799a3ed9a566502fa026d82 ] Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and Huawei TaiShan 2280 aarch64 servers). After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early log buffer went from something like 280 to 260000 which caused kmemleak disabled and crash dump memory reservation failed. The multitude of kmemleak_alloc() calls is from nested loops while KASAN is setting up full memory mappings, so let early kmemleak allocations skip those memblock_alloc_internal() calls came from kasan_init() given that those early KASAN memory mappings should not reference to other memory. Hence, no kmemleak false positives. kasan_init kasan_map_populate [1] kasan_pgd_populate [2] kasan_pud_populate [3] kasan_pmd_populate [4] kasan_pte_populate [5] kasan_alloc_zeroed_page memblock_alloc_try_nid memblock_alloc_internal kmemleak_alloc [1] for_each_memblock(memory, reg) [2] while (pgdp++, addr = next, addr != end) [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))) [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))) [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))) Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.us Signed-off-by: Qian Cai <cai@gmx.us> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26mm/swap: use nr_node_ids for avail_lists in swap_info_structAaron Lu
[ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ] Since a2468cc9bfdf ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26bpf: Allow narrow loads with offset > 0Andrey Ignatov
[ Upstream commit 46f53a65d2de3e1591636c22b626b09d8684fd71 ] Currently BPF verifier allows narrow loads for a context field only with offset zero. E.g. if there is a __u32 field then only the following loads are permitted: * off=0, size=1 (narrow); * off=0, size=2 (narrow); * off=0, size=4 (full). On the other hand LLVM can generate a load with offset different than zero that make sense from program logic point of view, but verifier doesn't accept it. E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code: #define DST_IP4 0xC0A801FEU /* 192.168.1.254 */ ... if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) && where ctx is struct bpf_sock_addr. Some versions of LLVM can produce the following byte code for it: 8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7) 9: 67 02 00 00 18 00 00 00 r2 <<= 24 10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll 12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6> where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1 and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently rejected by verifier. Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok() what means any is_valid_access implementation, that uses the function, works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or sock_addr_is_valid_access() for bpf_sock_addr. The patch makes such loads supported. Offset can be in [0; size_default) but has to be multiple of load size. E.g. for __u32 field the following loads are supported now: * off=0, size=1 (narrow); * off=1, size=1 (narrow); * off=2, size=1 (narrow); * off=3, size=1 (narrow); * off=0, size=2 (narrow); * off=2, size=2 (narrow); * off=0, size=4 (full). Reported-by: Yonghong Song <yhs@fb.com> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26writeback: don't decrement wb->refcnt if !wb->bdiAnders Roxell
[ Upstream commit 347a28b586802d09604a149c1a1f6de5dccbe6fa ] This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init), devtmpfs' handle_remove() crashes because the reference count is a NULL pointer only because wb->bdi hasn't been initialized yet. Rework so that wb_put have an extra check if wb->bdi before decrement wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again in other drivers. Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Co-developed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-26usb: typec: tcpm: Do not disconnect link for self powered devicesBadhri Jagan Sridharan
[ Upstream commit 23b5f73266e59a598c1e5dd435d87651b5a7626b ] During HARD_RESET the data link is disconnected. For self powered device, the spec is advising against doing that. >From USB_PD_R3_0 7.1.5 Response to Hard Resets Device operation during and after a Hard Reset is defined as follows: Self-powered devices Should Not disconnect from USB during a Hard Reset (see Section 9.1.2). Bus powered devices will disconnect from USB during a Hard Reset due to the loss of their power source. Tackle this by letting TCPM know whether the device is self or bus powered. This overcomes unnecessary port disconnections from hard reset. Also, speeds up the enumeration time when connected to Type-A ports. Signed-off-by: Badhri Jagan Sridharan <badhri@google.com> Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> --------- Version history: V3: Rebase on top of usb-next V2: Based on feedback from heikki.krogerus@linux.intel.com - self_powered added to the struct tcpm_port which is populated from a. "connector" node of the device tree in tcpm_fw_get_caps() b. "self_powered" node of the tcpc_config in tcpm_copy_caps Based on feedbase from linux@roeck-us.net - Code was refactored - SRC_HARD_RESET_VBUS_OFF sets the link state to false based on self_powered flag V1 located here: https://lkml.org/lkml/2018/9/13/94 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-22block: use rcu_work instead of call_rcu to avoid sleep in softirqYufen Yu
commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream. We recently got a stack by syzkaller like this: BUG: sleeping function called from invalid context at mm/slab.h:361 in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid INFO: lockdep is turned off. CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00 0000000000000000 0000000000000001 0000000000000004 0000000000000000 Call Trace: <IRQ> [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline] <IRQ> [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51 [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675 [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637 [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline] [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline] [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline] [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709 [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline] [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline] [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227 [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374 [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline] [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675 [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline] [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline] [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692 [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237 [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232 [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline] [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline] [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline] [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline] [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957 [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273 [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline] [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391 [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline] [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926 [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746 <EOI> [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180 [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626 [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043 [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline] [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050 [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a In softirq context, we call rcu callback function delete_partition_rcu_cb(), which may allocate memory by kzalloc with GFP_KERNEL flag. If the allocation cannot be satisfied, it may sleep. However, That is not allowed in softirq contex. Although we found this problem on linux 4.4, the latest kernel version seems to have this problem as well. And it is very similar to the previous one: https://lkml.org/lkml/2018/7/9/391 Fix it by using RCU workqueue, which allows sleep. Reviewed-by: Paul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22MIPS: BCM47XX: Setup struct device for the SoCRafał Miłecki
commit 321c46b91550adc03054125fa7a1639390608e1a upstream. So far we never had any device registered for the SoC. This resulted in some small issues that we kept ignoring like: 1) Not working GPIOLIB_IRQCHIP (gpiochip_irqchip_add_key() failing) 2) Lack of proper tree in the /sys/devices/ 3) mips_dma_alloc_coherent() silently handling empty coherent_dma_mask Kernel 4.19 came with a lot of DMA changes and caused a regression on bcm47xx. Starting with the commit f8c55dc6e828 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms") DMA coherent allocations just fail. Example: [ 1.114914] bgmac_bcma bcma0:2: Allocation of TX ring 0x200 failed [ 1.121215] bgmac_bcma bcma0:2: Unable to alloc memory for DMA [ 1.127626] bgmac_bcma: probe of bcma0:2 failed with error -12 [ 1.133838] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded The bgmac driver also triggers a WARNING: [ 0.959486] ------------[ cut here ]------------ [ 0.964387] WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 bgmac_enet_probe+0x1b4/0x5c4 [ 0.973751] Modules linked in: [ 0.976913] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.9 #0 [ 0.982750] Stack : 804a0000 804597c4 00000000 00000000 80458fd8 8381bc2c 838282d4 80481a47 [ 0.991367] 8042e3ec 00000001 804d38f0 00000204 83980000 00000065 8381bbe0 6f55b24f [ 0.999975] 00000000 00000000 80520000 00002018 00000000 00000075 00000007 00000000 [ 1.008583] 00000000 80480000 000ee811 00000000 00000000 00000000 80432c00 80248db8 [ 1.017196] 00000009 00000204 83980000 803ad7b0 00000000 801feeec 00000000 804d0000 [ 1.025804] ... [ 1.028325] Call Trace: [ 1.030875] [<8000aef8>] show_stack+0x58/0x100 [ 1.035513] [<8001f8b4>] __warn+0xe4/0x118 [ 1.039708] [<8001f9a4>] warn_slowpath_null+0x48/0x64 [ 1.044935] [<80248db8>] bgmac_enet_probe+0x1b4/0x5c4 [ 1.050101] [<802498e0>] bgmac_probe+0x558/0x590 [ 1.054906] [<80252fd0>] bcma_device_probe+0x38/0x70 [ 1.060017] [<8020e1e8>] really_probe+0x170/0x2e8 [ 1.064891] [<8020e714>] __driver_attach+0xa4/0xec [ 1.069784] [<8020c1e0>] bus_for_each_dev+0x58/0xb0 [ 1.074833] [<8020d590>] bus_add_driver+0xf8/0x218 [ 1.079731] [<8020ef24>] driver_register+0xcc/0x11c [ 1.084804] [<804b54cc>] bgmac_init+0x1c/0x44 [ 1.089258] [<8000121c>] do_one_initcall+0x7c/0x1a0 [ 1.094343] [<804a1d34>] kernel_init_freeable+0x150/0x218 [ 1.099886] [<803a082c>] kernel_init+0x10/0x104 [ 1.104583] [<80005878>] ret_from_kernel_thread+0x14/0x1c [ 1.110107] ---[ end trace f441c0d873d1fb5b ]--- This patch setups a "struct device" (and passes it to the bcma) which allows fixing all the mentioned problems. It'll also require a tiny bcma patch which will follow through the wireless tree & its maintainer. Fixes: f8c55dc6e828 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms") Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: Paul Burton <paul.burton@mips.com> Acked-by: Hauke Mehrtens <hauke@hauke-m.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-wireless@vger.kernel.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-22net: phy: Add missing features to PHY driversAndrew Lunn
[ Upstream commit 9e857a40dc4eba15a739b4194d7db873d82c28a0 ] The bcm87xx and micrel driver has PHYs which are missing the .features value. Add them. The bcm87xx is a 10G FEC only PHY. Add the needed features definition of this PHY. Fixes: 719655a14971 ("net: phy: Replace phy driver features u32 with link_mode bitmap") Reported-by: Scott Wood <oss@buserror.net> Reported-by: Camelia Groza <camelia.groza@nxp.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16sunrpc: use-after-free in svc_process_common()Vasily Averin
commit d4b09acf924b84bae77cad090a9d108e70b43643 upstream. if node have NFSv41+ mounts inside several net namespaces it can lead to use-after-free in svc_process_common() svc_process_common() /* Setup reply header */ rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE svc_process_common() can use incorrect rqstp->rq_xprt, its caller function bc_svc_process() takes it from serv->sv_bc_xprt. The problem is that serv is global structure but sv_bc_xprt is assigned per-netnamespace. According to Trond, the whole "let's set up rqstp->rq_xprt for the back channel" is nothing but a giant hack in order to work around the fact that svc_process_common() uses it to find the xpt_ops, and perform a couple of (meaningless for the back channel) tests of xpt_flags. All we really need in svc_process_common() is to be able to run rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr() Bruce J Fields points that this xpo_prep_reply_hdr() call is an awfully roundabout way just to do "svc_putnl(resv, 0);" in the tcp case. This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(), now it calls svc_process_common() with rqstp->rq_xprt = NULL. To adjust reply header svc_process_common() just check rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case. To handle rqstp->rq_xprt = NULL case in functions called from svc_process_common() patch intruduces net namespace pointer svc_rqst->rq_bc_net and adjust SVC_NET() definition. Some other function was also adopted to properly handle described case. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Cc: stable@vger.kernel.org Fixes: 23c20ecd4475 ("NFS: callback up - users counting cleanup") Signed-off-by: J. Bruce Fields <bfields@redhat.com> v2: added lost extern svc_tcp_prep_reply_hdr() Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16x86, modpost: Replace last remnants of RETPOLINE with CONFIG_RETPOLINEWANG Chao
commit e4f358916d528d479c3c12bd2fd03f2d5a576380 upstream. Commit 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support") replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the remaining pieces. [ bp: Massage commit message. ] Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support") Signed-off-by: WANG Chao <chao.wang@ucloud.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: linux-kbuild@vger.kernel.org Cc: srinivas.eeda@oracle.com Cc: stable <stable@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16cpufreq: scpi/scmi: Fix freeing of dynamic OPPsViresh Kumar
commit 1690d8bb91e370ab772062b79bd434ce815c4729 upstream. Since the commit 2a4eb7358aba "OPP: Don't remove dynamic OPPs from _dev_pm_opp_remove_table()", dynamically created OPP aren't automatically removed anymore by dev_pm_opp_cpumask_remove_table(). This affects the scpi and scmi cpufreq drivers which no longer free OPPs on failures or on invocations of the policy->exit() callback. Create a generic OPP helper dev_pm_opp_remove_all_dynamic() which can be called from these drivers instead of dev_pm_opp_cpumask_remove_table(). In dev_pm_opp_remove_all_dynamic(), we need to make sure that the opp_list isn't getting accessed simultaneously from other parts of the OPP core while the helper is freeing dynamic OPPs, i.e. we can't drop the opp_table->lock while traversing through the OPP list. And to accomplish that, this patch also creates _opp_kref_release_unlocked() which can be called from this new helper with the opp_table lock already held. Cc: 4.20 <stable@vger.kernel.org> # v4.20 Reported-by: Valentin Schneider <valentin.schneider@arm.com> Fixes: 2a4eb7358aba "OPP: Don't remove dynamic OPPs from _dev_pm_opp_remove_table()" Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, hmm: use devm semantics for hmm_devmem_{add, remove}Dan Williams
commit 58ef15b765af0d2cbe6799ec564f1dc485010ab8 upstream. devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13mm, devm_memremap_pages: fix shutdown handlingDan Williams
commit a95c90f1e2c253b280385ecf3d4ebfe476926b28 upstream. The last step before devm_memremap_pages() returns success is to allocate a release action, devm_memremap_pages_release(), to tear the entire setup down. However, the result from devm_add_action() is not checked. Checking the error from devm_add_action() is not enough. The api currently relies on the fact that the percpu_ref it is using is killed by the time the devm_memremap_pages_release() is run. Rather than continue this awkward situation, offload the responsibility of killing the percpu_ref to devm_memremap_pages_release() directly. This allows devm_memremap_pages() to do the right thing relative to init failures and shutdown. Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range. An argument could be made to require that the ->kill() operation be set in the @pgmap arg rather than passed in separately. However, it helps code readability, tracking the lifetime of a given instance, to be able to grep the kill routine directly at the devm_memremap_pages() call site. Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...") Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com> Reported-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09binder: fix use-after-free due to ksys_close() during fdget()Todd Kjos
commit 80cd795630d6526ba729a089a435bf74a57af927 upstream. 44d8047f1d8 ("binder: use standard functions to allocate fds") exposed a pre-existing issue in the binder driver. fdget() is used in ksys_ioctl() as a performance optimization. One of the rules associated with fdget() is that ksys_close() must not be called between the fdget() and the fdput(). There is a case where this requirement is not met in the binder driver which results in the reference count dropping to 0 when the device is still in use. This can result in use-after-free or other issues. If userpace has passed a file-descriptor for the binder driver using a BINDER_TYPE_FDA object, then kys_close() is called on it when handling a binder_ioctl(BC_FREE_BUFFER) command. This violates the assumptions for using fdget(). The problem is fixed by deferring the close using task_work_add(). A new variant of __close_fd() was created that returns a struct file with a reference. The fput() is deferred instead of using ksys_close(). Fixes: 44d8047f1d87a ("binder: use standard functions to allocate fds") Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Todd Kjos <tkjos@google.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09platform-msi: Free descriptors in platform_msi_domain_free()Miquel Raynal
commit 81b1e6e6a8590a19257e37a1633bec098d499c57 upstream. Since the addition of platform MSI support, there were two helpers supposed to allocate/free IRQs for a device: platform_msi_domain_alloc_irqs() platform_msi_domain_free_irqs() In these helpers, IRQ descriptors are allocated in the "alloc" routine while they are freed in the "free" one. Later, two other helpers have been added to handle IRQ domains on top of MSI domains: platform_msi_domain_alloc() platform_msi_domain_free() Seen from the outside, the logic is pretty close with the former helpers and people used it with the same logic as before: a platform_msi_domain_alloc() call should be balanced with a platform_msi_domain_free() call. While this is probably what was intended to do, the platform_msi_domain_free() does not remove/free the IRQ descriptor(s) created/inserted in platform_msi_domain_alloc(). One effect of such situation is that removing a module that requested an IRQ will let one orphaned IRQ descriptor (with an allocated MSI entry) in the device descriptors list. Next time the module will be inserted back, one will observe that the allocation will happen twice in the MSI domain, one time for the remaining descriptor, one time for the new one. It also has the side effect to quickly overshoot the maximum number of allocated MSI and then prevent any module requesting an interrupt in the same domain to be inserted anymore. This situation has been met with loops of insertion/removal of the mvpp2.ko module (requesting 15 MSIs each time). Fixes: 552c494a7666 ("platform-msi: Allow creation of a MSI-based stacked irq domain") Cc: stable@vger.kernel.org Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-09ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()Cong Wang
[ Upstream commit aff6db454599d62191aabc208930e891748e4322 ] __ptr_ring_swap_queue() tries to move pointers from the old ring to the new one, but it forgets to check if ->producer is beyond the new size at the end of the operation. This leads to an out-of-bound access in __ptr_ring_produce() as reported by syzbot. Reported-by: syzbot+8993c0fa96d57c399735@syzkaller.appspotmail.com Fixes: 5d49de532002 ("ptr_ring: resize support") Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-22Merge tag 'compiler-attributes-for-linus-v4.20' of ↵Linus Torvalds
https://github.com/ojeda/linux Pull compiler_types.h fix from Miguel Ojeda: "A cleanup for userspace in compiler_types.h: don't pollute userspace with macro definitions (Xiaozhou Liu) This is harmless for the kernel, but v4.19 was released with a few macros exposed to userspace as the patch explains; which this removes, so it *could* happen that we break something for someone (although leaving inline redefined is probably worse)" * tag 'compiler-attributes-for-linus-v4.20' of https://github.com/ojeda/linux: include/linux/compiler_types.h: don't pollute userspace with macro definitions
2018-12-22dma-mapping: fix flags in dma_alloc_wcChristoph Hellwig
We really need the writecombine flag in dma_alloc_wc, fix a stupid oversight. Fixes: 7ed1d91a9e ("dma-mapping: translate __GFP_NOFAIL to DMA_ATTR_NO_WARN") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-21Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "The biggest part is a series of reverts for the macro based GCC inlining workarounds. It caused regressions in distro build and other kernel tooling environments, and the GCC project was very receptive to fixing the underlying inliner weaknesses - so as time ran out we decided to do a reasonably straightforward revert of the patches. The plan is to rely on the 'asm inline' GCC 9 feature, which might be backported to GCC 8 and could thus become reasonably widely available on modern distros. Other than those reverts, there's misc fixes from all around the place. I wish our final x86 pull request for v4.20 was smaller..." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "kbuild/Makefile: Prepare for using macros in inline assembly code to work around asm() related GCC inlining bugs" Revert "x86/objtool: Use asm macros to work around GCC inlining bugs" Revert "x86/refcount: Work around GCC inlining bug" Revert "x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs" Revert "x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs" Revert "x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops" Revert "x86/extable: Macrofy inline assembly code to work around GCC inlining bugs" Revert "x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs" Revert "x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs" x86/mtrr: Don't copy uninitialized gentry fields back to userspace x86/fsgsbase/64: Fix the base write helper functions x86/mm/cpa: Fix cpa_flush_array() TLB invalidation x86/vdso: Pass --eh-frame-hdr to the linker x86/mm: Fix decoy address handling vs 32-bit builds x86/intel_rdt: Ensure a CPU remains online for the region's pseudo-locking sequence x86/dump_pagetables: Fix LDT remap address marker x86/mm: Fix guard hole handling
2018-12-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Off by one in netlink parsing of mac802154_hwsim, from Alexander Aring. 2) nf_tables RCU usage fix from Taehee Yoo. 3) Flow dissector needs nhoff and thoff clamping, from Stanislav Fomichev. 4) Missing sin6_flowinfo initialization in SCTP, from Xin Long. 5) Spectrev1 in ipmr and ip6mr, from Gustavo A. R. Silva. 6) Fix r8169 crash when DEBUG_SHIRQ is enabled, from Heiner Kallweit. 7) Fix SKB leak in rtlwifi, from Larry Finger. 8) Fix state pruning in bpf verifier, from Jakub Kicinski. 9) Don't handle completely duplicate fragments as overlapping, from Michal Kubecek. 10) Fix memory corruption with macb and 64-bit DMA, from Anssi Hannula. 11) Fix TCP fallback socket release in smc, from Myungho Jung. 12) gro_cells_destroy needs to napi_disable, from Lorenzo Bianconi. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (130 commits) rds: Fix warning. neighbor: NTF_PROXY is a valid ndm_flag for a dump request net: mvpp2: fix the phylink mode validation net/sched: cls_flower: Remove old entries from rhashtable net/tls: allocate tls context using GFP_ATOMIC iptunnel: make TUNNEL_FLAGS available in uapi gro_cell: add napi_disable in gro_cells_destroy lan743x: Remove MAC Reset from initialization net/mlx5e: Remove the false indication of software timestamping support net/mlx5: Typo fix in del_sw_hw_rule net/mlx5e: RX, Fix wrong early return in receive queue poll ipv6: explicitly initialize udp6_addr in udp_sock_create6() bnxt_en: Fix ethtool self-test loopback. net/rds: remove user triggered WARN_ON in rds_sendmsg net/rds: fix warn in rds_message_alloc_sgs ath10k: skip sending quiet mode cmd for WCN3990 mac80211: free skb fraglist before freeing the skb nl80211: fix memory leak if validate_pae_over_nl80211() fails net/smc: fix TCP fallback socket release vxge: ensure data0 is initialized in when fetching firmware version information ...
2018-12-19Revert "x86/objtool: Use asm macros to work around GCC inlining bugs"Ingo Molnar
This reverts commit c06c4d8090513f2974dfdbed2ac98634357ac475. See this commit for details about the revert: e769742d3584 ("Revert "x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs"") Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Borislav Petkov <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Juergen Gross <jgross@suse.com> Cc: Richard Biener <rguenther@suse.de> Cc: Kees Cook <keescook@chromium.org> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Nadav Amit <namit@vmware.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-18Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Three fixes: The t10-pi one is a regression from the 4.19 release, the qla2xxx one is a 4.20 merge window regression and the bnx2fc is a very old bug" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: t10-pi: Return correct ref tag when queue has no integrity profile scsi: bnx2fc: Fix NULL dereference in error handling Revert "scsi: qla2xxx: Fix NVMe Target discovery"
2018-12-15mod_devicetable.h: correct kerneldoc typo, "PHYSID2" -> "MII_PHYSID2"Robert P. J. Day
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2018-12-15 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix liveness propagation of callee saved registers, from Jakub. 2) fix overflow in bpf_jit_limit knob, from Daniel. 3) bpf_flow_dissector api fix, from Stanislav. 4) bpf_perf_event api fix on powerpc, from Sandipan. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-14mm/sparse: add common helper to mark all memblocks presentLogan Gunthorpe
Presently the arches arm64, arm and sh have a function which loops through each memblock and calls memory present. riscv will require a similar function. Introduce a common memblocks_present() function that can be used by all the arches. Subsequent patches will cleanup the arches that make use of this. Link: http://lkml.kernel.org/r/20181107205433.3875-3-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14mm: introduce common STRUCT_PAGE_MAX_SHIFT defineLogan Gunthorpe
This define is used by arm64 to calculate the size of the vmemmap region. It is defined as the log2 of the upper bound on the size of a struct page. We move it into mm_types.h so it can be defined properly instead of set and checked with a build bug. This also allows us to use the same define for riscv. Link: http://lkml.kernel.org/r/20181107205433.3875-2-logang@deltatee.com Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14include/linux/compiler_types.h: don't pollute userspace with macro definitionsXiaozhou Liu
Macros 'inline' and '__gnu_inline' used to be defined in compiler-gcc.h, which was (and is) included entirely in (__KERNEL__ && !__ASSEMBLY__). Commit 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually exclusive") had those macros exposed to userspace, unintentionally. Then commit a3f8a30f3f00 ("Compiler Attributes: use feature checks instead of version checks") moved '__gnu_inline' back into (__KERNEL__ && !__ASSEMBLY__) and 'inline' was left behind. Since 'inline' depends on '__gnu_inline', compiling error showing "unknown type name ‘__gnu_inline’" will pop up, if userspace somehow includes <linux/compiler.h>. Other macros like __must_check, notrace, etc. are in a similar situation. So just move all these macros back into (__KERNEL__ && !__ASSEMBLY__). Note: 1. This patch only affects what userspace sees. 2. __must_check (when !CONFIG_ENABLE_MUST_CHECK) and noinline_for_stack were once defined in __KERNEL__ only, but we believe that they can be put into !__ASSEMBLY__ too. Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Xiaozhou Liu <liuxiaozhou@bytedance.com> Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-12-13Merge tag 'mlx5-fixes-2018-12-13' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux mlx5-fixes-2018-12-13 Subject: [pull request][net 0/9] Mellanox, mlx5 fixes 2018-12-13 Saeed Mahameed says: ==================== This series introduces some fixes to the mlx5 core and mlx5e netdevice driver. ======= Conflict with net-next: When merged with net-next this series will cause a moderate conflict: 1) in drivers/net/ethernet/mellanox/mlx5/core/en_tc.c (2 hunks) Take hunks from net only and just replace *attr->mirror_count to *attr->split_count 1.1) there is one more instance of slow_attr->mirror_count to be replaced with slow_attr->split_count, it doesn't appear in the conflict, it will cause a compilation error if left out. 2) in mlx5_ifc.h, take hunks only from net. Example for the merge resolution can be found at: https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=merge/mlx5-fixes&id=48830adf29804d85d77ed8a251d625db0eb5b8a8 branch merge/mlx5-fixes of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux (I simply merged this pull request tag into net-next and resolved the conflict) I don't know if it's ok with you, but to save your time, you can just: git pull git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux merge/mlx5-fixes Into net-next, before your next net merge, and you will have a clean merge of net into net-next (at least for mlx5 files). ====== Please pull and let me know if there's any problem. For -stable v4.18 338d615be484 ('net/mlx5e: Cancel DIM work on close SQ') 91f40f9904ad ('net/mlx5e: RX, Verify MPWQE stride size is in range') For -stable v4.19 c5c7e1c41bbe ('net/mlx5e: Remove unused UDP GSO remaining counter') ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-13Merge tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull XArray fixes from Matthew Wilcox: "Two bugfixes, each with test-suite updates, two improvements to the test-suite without associated bugs, and one patch adding a missing API" * tag 'xarray-4.20-rc7' of git://git.infradead.org/users/willy/linux-dax: XArray: Fix xa_alloc when id exceeds max XArray tests: Check iterating over multiorder entries XArray tests: Handle larger indices more elegantly XArray: Add xa_cmpxchg_irq and xa_cmpxchg_bh radix tree: Don't return retry entries from lookup
2018-12-13net/mlx5: E-Switch, Fix fdb cap bits swapVu Pham
The cap bits locations for the fdb caps of multi path to table (used for local mirroring) and multi encap (used for prio/chains) were wrongly used in swapped locations. This went unnoted so far b/c we tested the offending patch with CX5 FW that supports both of them. On different environments where not both caps are supported, we will be messed up, fix that. Fixes: b9aa0ba17af5 ('net/mlx5: Add cap bits for multi fdb encap') Signed-off-by: Vu Pham <vu@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Tested-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix warnings suspicious rcu usage when handling base chain statistics, from Taehee Yoo. 2) Refetch pointer to tcp header from nf_ct_sack_adjust() since skb_make_writable() may reallocate data area, reported by Google folks patch from Florian. 3) Incorrect netlink nest end after previous cancellation from error path in ipset, from Pan Bian. 4) Use dst_hold_safe() from nf_xfrm_me_harder(), from Florian. 5) Use rb_link_node_rcu() for rcu-protected rbtree node in nf_conncount, from Taehee Yoo. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-11bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64KDaniel Borkmann
Michael and Sandipan report: Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF JIT allocations. At compile time it defaults to PAGE_SIZE * 40000, and is adjusted again at init time if MODULES_VADDR is defined. For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with the compile-time default at boot-time, which is 0x9c400000 when using 64K page size. This overflows the signed 32-bit bpf_jit_limit value: root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit -1673527296 and can cause various unexpected failures throughout the network stack. In one case `strace dhclient eth0` reported: setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524) and similar failures can be seen with tools like tcpdump. This doesn't always reproduce however, and I'm not sure why. The more consistent failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host would time out on systemd/netplan configuring a virtio-net NIC with no noticeable errors in the logs. Given this and also given that in near future some architectures like arm64 will have a custom area for BPF JIT image allocations we should get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For 4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec() so therefore add another overridable bpf_jit_alloc_exec_limit() helper function which returns the possible size of the memory area for deriving the default heuristic in bpf_jit_charge_init(). Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default JIT memory provider, and therefore in case archs implement their custom module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}. Additionally, for archs supporting large page sizes, we should change the sysctl to be handled as long to not run into sysctl restrictions in future. Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations") Reported-by: Sandipan Das <sandipan@linux.ibm.com> Reported-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-12-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "A decent batch of fixes here. I'd say about half are for problems that have existed for a while, and half are for new regressions added in the 4.20 merge window. 1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach. 2) Revert bogus emac driver change, from Benjamin Herrenschmidt. 3) Handle BPF exported data structure with pointers when building 32-bit userland, from Daniel Borkmann. 4) Memory leak fix in act_police, from Davide Caratti. 5) Check RX checksum offload in RX descriptors properly in aquantia driver, from Dmitry Bogdanov. 6) SKB unlink fix in various spots, from Edward Cree. 7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from Eric Dumazet. 8) Fix FID leak in mlxsw driver, from Ido Schimmel. 9) IOTLB locking fix in vhost, from Jean-Philippe Brucker. 10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory limits otherwise namespace exit can hang. From Jiri Wiesner. 11) Address block parsing length fixes in x25 from Martin Schiller. 12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan. 13) For tun interfaces, only iface delete works with rtnl ops, enforce this by disallowing add. From Nicolas Dichtel. 14) Use after free in liquidio, from Pan Bian. 15) Fix SKB use after passing to netif_receive_skb(), from Prashant Bhole. 16) Static key accounting and other fixes in XPS from Sabrina Dubroca. 17) Partially initialized flow key passed to ip6_route_output(), from Shmulik Ladkani. 18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas Falcon. 19) Several small TCP fixes (off-by-one on window probe abort, NULL deref in tail loss probe, SNMP mis-estimations) from Yuchung Cheng" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits) net/sched: cls_flower: Reject duplicated rules also under skip_sw bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips. bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips. bnxt_en: Keep track of reserved IRQs. bnxt_en: Fix CNP CoS queue regression. net/mlx4_core: Correctly set PFC param if global pause is turned off. Revert "net/ibm/emac: wrong bit is used for STA control" neighbour: Avoid writing before skb->head in neigh_hh_output() ipv6: Check available headroom in ip6_xmit() even without options tcp: lack of available data can also cause TSO defer ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl mlxsw: spectrum_router: Relax GRE decap matching check mlxsw: spectrum_switchdev: Avoid leaking FID's reference count mlxsw: spectrum_nve: Remove easily triggerable warnings ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes sctp: frag_point sanity check tcp: fix NULL ref in tail loss probe tcp: Do not underestimate rwnd_limited net: use skb_list_del_init() to remove from RX sublists ...
2018-12-09Merge tag 'char-misc-4.20-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small driver fixes for 4.20-rc6. There is a hyperv fix that for some reaon took forever to get into a shape that could be applied to the tree properly, but resolves a much reported issue. The others are some gnss patches, one a bugfix and the two others updates to the MAINTAINERS file to properly match the gnss files in the tree. All have been in linux-next for a while with no reported issues" * tag 'char-misc-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: MAINTAINERS: exclude gnss from SIRFPRIMA2 regex matching MAINTAINERS: add gnss scm tree gnss: sirf: fix activation retry handling Drivers: hv: vmbus: Offload the handling of channels to two workqueues
2018-12-09Merge tag 'usb-4.20-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are some small USB fixes for 4.20-rc6 The "largest" here are some xhci fixes for reported issues. Also here is a USB core fix, some quirk additions, and a usb-serial fix which required the export of one of the tty layer's functions to prevent code duplication. The tty maintainer agreed with this change. All of these have been in linux-next with no reported issues" * tag 'usb-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: xhci: Prevent U1/U2 link pm states if exit latency is too long xhci: workaround CSS timeout on AMD SNPS 3.0 xHC USB: check usb_get_extra_descriptor for proper size USB: serial: console: fix reported terminal settings usb: quirk: add no-LPM quirk on SanDisk Ultra Flair device USB: Fix invalid-free bug in port_over_current_notify() usb: appledisplay: Add 27" Apple Cinema Display
2018-12-09Merge tag 'dax-fixes-4.20-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fixes from Dan Williams: "The last of the known regression fixes and fallout from the Xarray conversion of the filesystem-dax implementation. On the path to debugging why the dax memory-failure injection test started failing after the Xarray conversion a couple more fixes for the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced. Those plus the bug that started the hunt are now addressed. These patches have appeared in a -next release with no issues reported. Note the touches to mm/memory-failure.c are just the conversion to the new function signature for dax_lock_page(). Summary: - Fix the Xarray conversion of fsdax to properly handle dax_lock_mapping_entry() in the presense of pmd entries - Fix inode destruction racing a new lock request" * tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix unlock mismatch with updated API dax: Don't access a freed inode dax: Check page->mapping isn't NULL
2018-12-08Revert "mm, thp: consolidate THP gfp handling into ↵David Rientjes
alloc_hugepage_direct_gfpmask" This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317. This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations"). The movement of the thp allocation policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was intended to only set __GFP_THISNODE for mempolicies that are not MPOL_BIND whereas the revert could set this regardless of mempolicy. While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() was racy, that has since been removed since the revert. What is left is the possibility to use __GFP_THISNODE in policy_node() when it is unexpected because the special handling for hugepages in alloc_pages_vma() was removed as part of the consolidation. Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat different policy for hugepage allocations, which were allocated through alloc_hugepage_vma(). For hugepage allocations, if the allocating process's node is in the set of allowed nodes, allocate with __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in mm/mempolicy.c which is functionally different behavior and removes the requirement to only allocate hugepages locally. So this commit does a full revert of 89c83fb539f9 instead of the partial revert that was done in 2f0799a0ffc0. The result is the same thp allocation policy for 4.20 that was in 4.19. Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-07scsi: t10-pi: Return correct ref tag when queue has no integrity profileMartin K. Petersen
Commit ddd0bc756983 ("block: move ref_tag calculation func to the block layer") moved ref tag calculation from SCSI to a library function. However, this change broke returning the correct ref tag for devices operating in DIF mode since these do not have an associated block integrity profile. This in turn caused read/write failures on PI-formatted disks attached to an mpt3sas controller. Fixes: ddd0bc756983 ("block: move ref_tag calculation func to the block layer") Cc: stable@vger.kernel.org # 4.19+ Reported-by: John Garry <john.garry@huawei.com> Tested-by: Xiang Chen <chenxiang66@hisilicon.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>