summaryrefslogtreecommitdiff
path: root/include/uapi/linux
AgeCommit message (Collapse)Author
2017-09-06userfaultfd: provide pid in userfault msgAlexey Perevalov
It could be useful for calculating downtime during postcopy live migration per vCPU. Side observer or application itself will be informed about proper task's sleep during userfaultfd processing. Process's thread id is being provided when user requeste it by setting UFFD_FEATURE_THREAD_ID bit into uffdio_api.features. Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: userfaultfd: add feature to request for a signal deliveryPrakash Sangappa
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal to the faulting process, instead of the page-fault event. Dealing with page-fault event using a monitor thread can be an overhead in these cases. For example applications like the database could use the signaling mechanism for robustness purpose. Database uses hugetlbfs for performance reason. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support. These files are mmapped and accessed by many processes as shared memory. The database keeps track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. Using userfaultfd mechanism with this support to get a signal, database application can prevent pages from being allocated implicitly when processes access mapped address over holes in the file. This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to request for a SIGBUS signal. See following for previous discussion about the database requirement leading to this proposal as suggested by Andrea. http://www.spinics.net/lists/linux-mm/msg129224.html Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: shm: use new hugetlb size encoding definitionsMike Kravetz
Use the common definitions from hugetlb_encode.h header file for encoding hugetlb size definitions in shmget system call flags. In addition, move these definitions from the internal (kernel) to user (uapi) header file. Link: http://lkml.kernel.org/r/1501527386-10736-4-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06mm: arch: consolidate mmap hugetlb size encodingsMike Kravetz
A non-default huge page size can be encoded in the flags argument of the mmap system call. The definitions for these encodings are in arch specific header files. However, all architectures use the same values. Consolidate all the definitions in the primary user header file (uapi/linux/mman.h). Include definitions for all known huge page sizes. Use the generic encoding definitions in hugetlb_encode.h as the basis for these definitions. Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Support ipv6 checksum offload in sunvnet driver, from Shannon Nelson. 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric Dumazet. 3) Allow generic XDP to work on virtual devices, from John Fastabend. 4) Add bpf device maps and XDP_REDIRECT, which can be used to build arbitrary switching frameworks using XDP. From John Fastabend. 5) Remove UFO offloads from the tree, gave us little other than bugs. 6) Remove the IPSEC flow cache, from Florian Westphal. 7) Support ipv6 route offload in mlxsw driver. 8) Support VF representors in bnxt_en, from Sathya Perla. 9) Add support for forward error correction modes to ethtool, from Vidya Sagar Ravipati. 10) Add time filter for packet scheduler action dumping, from Jamal Hadi Salim. 11) Extend the zerocopy sendmsg() used by virtio and tap to regular sockets via MSG_ZEROCOPY. From Willem de Bruijn. 12) Significantly rework value tracking in the BPF verifier, from Edward Cree. 13) Add new jump instructions to eBPF, from Daniel Borkmann. 14) Rework rtnetlink plumbing so that operations can be run without taking the RTNL semaphore. From Florian Westphal. 15) Support XDP in tap driver, from Jason Wang. 16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal. 17) Add Huawei hinic ethernet driver. 18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan Delalande. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits) i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq i40e: avoid NVM acquire deadlock during NVM update drivers: net: xgene: Remove return statement from void function drivers: net: xgene: Configure tx/rx delay for ACPI drivers: net: xgene: Read tx/rx delay for ACPI rocker: fix kcalloc parameter order rds: Fix non-atomic operation on shared flag variable net: sched: don't use GFP_KERNEL under spin lock vhost_net: correctly check tx avail during rx busy polling net: mdio-mux: add mdio_mux parameter to mdio_mux_init() rxrpc: Make service connection lookup always check for retry net: stmmac: Delete dead code for MDIO registration gianfar: Fix Tx flow control deactivation cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6 cxgb4: Fix pause frame count in t4_get_port_stats cxgb4: fix memory leak tun: rename generic_xdp to skb_xdp tun: reserve extra headroom only when XDP is set net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping net: dsa: bcm_sf2: Advertise number of egress queues ...
2017-09-06Merge tag 'dlm-4.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This set includes a bunch of minor code cleanups that have accumulated, probably from code analyzers people like to run. There is one nice fix that avoids some socket leaks by switching to use sock_create_lite()" * tag 'dlm-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: use sock_create_lite inside tcp_accept_from_sock uapi linux/dlm_netlink.h: include linux/dlmconstants.h dlm: avoid double-free on error path in dlm_device_{register,unregister} dlm: constify kset_uevent_ops structure dlm: print log message when cluster name is not set dlm: Delete an unnecessary variable initialisation in dlm_ls_start() dlm: Improve a size determination in two functions dlm: Use kcalloc() in two functions dlm: Use kmalloc_array() in make_member_array() dlm: Delete an error message for a failed memory allocation in dlm_recover_waiters_pre() dlm: Improve a size determination in dlm_recover_waiters_pre() dlm: Use kcalloc() in dlm_scan_waiters() dlm: Improve a size determination in table_seq_start() dlm: Add spaces for better code readability dlm: Replace six seq_puts() calls by seq_putc() dlm: Make dismatch error message more clear dlm: Fix kernel memory disclosure
2017-09-06Merge tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull XFS updates from Darrick Wong: "Here are the changes for xfs for 4.14. Most of these are cleanups and fixes for bad behavior, as we're mostly focusing on improving reliablity this cycle (read: there's potentially a lot of stuff on the horizon for 4.15 so better to spend a few weeks killing other bugs now). Summary: - Write unmount record for a ro mount to avoid unnecessary log replay - Clean up orphaned inodes when mounting fs readonly - Resubmit inode log items when buffer writeback fails to avoid umount hang - Fix log recovery corruption problems when log headers wrap around the end - Avoid infinite loop searching for free inodes when inode counters are wrong - Evict inodes involved with log redo so that we don't leak them later - Fix a potential race between reclaim and inode cluster freeing - Refactor the inode joining code w.r.t. transaction rolling & deferred ops - Fix a bug where the log doesn't properly deal with dirty buffers that are about to become ordered buffers - Fix the extent swap code to deal with making dirty buffers ordered properly - Consolidate page fault handlers - Refactor the incore extent manipulation functions to use the iext abstractions instead of directly modifying with extent data - Disable crashy chattr +/-x until we fix it - Don't allow us to set S_DAX for v2 inodes - Various cleanups - Clarify some documentation - Fix a problem where fsync and a log commit race to send the disk a flush command, resulting in a small window where power fail data loss could occur - Simplify some rmap operations in the fcollapse code - Fix some use-after-free problems in async writeback" * tag 'xfs-4.14-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (44 commits) xfs: use kmem_free to free return value of kmem_zalloc xfs: open code end_buffer_async_write in xfs_finish_page_writeback xfs: don't set v3 xflags for v2 inodes xfs: fix compiler warnings fsmap: fix documentation of FMR_OF_LAST xfs: simplify the rmap code in xfs_bmse_merge xfs: remove unused flags arg from xfs_file_iomap_begin_delay xfs: fix incorrect log_flushed on fsync xfs: disable per-inode DAX flag xfs: replace xfs_qm_get_rtblks with a direct call to xfs_bmap_count_leaves xfs: rewrite xfs_bmap_count_leaves using xfs_iext_get_extent xfs: use xfs_iext_*_extent helpers in xfs_bmap_split_extent_at xfs: use xfs_iext_*_extent helpers in xfs_bmap_shift_extents xfs: move some code around inside xfs_bmap_shift_extents xfs: use xfs_iext_get_extent in xfs_bmap_first_unused xfs: switch xfs_bmap_local_to_extents to use xfs_iext_insert xfs: add a xfs_iext_update_extent helper xfs: consolidate the various page fault handlers iomap: return VM_FAULT_* codes from iomap_page_mkwrite xfs: relog dirty buffers during swapext bmbt owner change ...
2017-09-05Merge tag 'char-misc-4.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big char/misc driver update for 4.14-rc1. Lots of different stuff in here, it's been an active development cycle for some reason. Highlights are: - updated binder driver, this brings binder up to date with what shipped in the Android O release, plus some more changes that happened since then that are in the Android development trees. - coresight updates and fixes - mux driver file renames to be a bit "nicer" - intel_th driver updates - normal set of hyper-v updates and changes - small fpga subsystem and driver updates - lots of const code changes all over the driver trees - extcon driver updates - fmc driver subsystem upadates - w1 subsystem minor reworks and new features and drivers added - spmi driver updates Plus a smattering of other minor driver updates and fixes. All of these have been in linux-next with no reported issues for a while" * tag 'char-misc-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (244 commits) ANDROID: binder: don't queue async transactions to thread. ANDROID: binder: don't enqueue death notifications to thread todo. ANDROID: binder: Don't BUG_ON(!spin_is_locked()). ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctl ANDROID: binder: push new transactions to waiting threads. ANDROID: binder: remove proc waitqueue android: binder: Add page usage in binder stats android: binder: fixup crash introduced by moving buffer hdr drivers: w1: add hwmon temp support for w1_therm drivers: w1: refactor w1_slave_show to make the temp reading functionality separate drivers: w1: add hwmon support structures eeprom: idt_89hpesx: Support both ACPI and OF probing mcb: Fix an error handling path in 'chameleon_parse_cells()' MCB: add support for SC31 to mcb-lpc mux: make device_type const char: virtio: constify attribute_group structures. Documentation/ABI: document the nvmem sysfs files lkdtm: fix spelling mistake: "incremeted" -> "incremented" perf: cs-etm: Fix ETMv4 CONFIGR entry in perf.data file nvmem: include linux/err.h from header ...
2017-09-05Merge tag 'tty-4.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial updates from Greg KH: "Here is the big tty/serial driver update for 4.14-rc1. Well, not all that big, just a number of small serial driver fixes, and a new serial driver. Also in here are some much needed goldfish tty driver (emulator) fixes to try to get that codebase under control. All of these have been in linux-next for a while with no reported issues" * tag 'tty-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (94 commits) tty: goldfish: Implement support for kernel 'earlycon' parameter tty: goldfish: Use streaming DMA for r/w operations on Ranchu platforms tty: goldfish: Refactor constants to better reflect their nature serial: 8250_port: Remove useless NULL checks earlycon: initialise baud field of earlycon device structure tty: hvcs: make ktermios const pty: show associative slave of ptmx in fdinfo tty: n_gsm: Add compat_ioctl tty: hvcs: constify vio_device_id tty: hvc_vio: constify vio_device_id tty: mips_ejtag_fdc: constify mips_cdmm_device_id Introduce 8250_men_mcb mcb: introduce mcb_get_resource() serial: imx: Avoid post-PIO cleanup if TX DMA is started tty: serial: imx: disable irq after suspend serial: 8250_uniphier: add suspend/resume support serial: 8250_uniphier: use CHAR register for canary to detect power-off serial: 8250_uniphier: fix serial port index in private data serial: 8250: of: Add new port type for MediaTek BTIF controller on MT7622/23 SoC dt-bindings: serial: 8250: Add MediaTek BTIF controller bindings ...
2017-09-05Merge tag 'usb-4.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB/PHY driver updates from Greg KH: "Here is the large USB and PHY driver update for 4.14-rc1. Not all that exciting, a few new PHY drivers, the usual mess of gadget driver updates and fixes, and of course, xhci updates to try to tame that beast. A number of usb-serial updates and other small fixes all over the USB driver tree are in here as well. Full details are in the shortlog. All of these have been in linux-next for a while with no reported issues" * tag 'usb-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (171 commits) usbip: vhci-hcd: make vhci_hc_driver const usb: phy: Avoid unchecked dereference warning usb: imx21-hcd: make imx21_hc_driver const usb: host: make ehci_fsl_overrides const and __initconst dt-bindings: mt8173-mtu3: add generic compatible and rename file dt-bindings: mt8173-xhci: add generic compatible and rename file usb: xhci-mtk: add generic compatible string usbip: auto retry for concurrent attach USB: serial: option: simplify 3 D-Link device entries USB: serial: option: add support for D-Link DWM-157 C1 usb: core: usbport: fix "BUG: key not in .data" when lockdep is enabled usb: chipidea: usb2: check memory allocation failure usb: Add device quirk for Logitech HD Pro Webcam C920-C usb: misc: lvstest: add entry to place port in compliance mode usb: xhci: Support enabling of compliance mode for xhci 1.1 usb:xhci:Fix regression when ATI chipsets detected usb: quirks: add delay init quirk for Corsair Strafe RGB keyboard usb: gadget: make snd_pcm_hardware const usb: common: use of_property_read_bool() USB: core: constify vm_operations_struct ...
2017-09-05media: dvb headers: make checkpatch happierMauro Carvalho Chehab
Adjust dvb ca.h, dmx.h and frontend.h in order to make checkpatch happier. Now, it only complains about the typedefs, and those are there just to provide backward userspace compatibility. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: ca.h: document ca_msg and the corresponding ioctlsMauro Carvalho Chehab
Usually, CA messages are sent/received via reading/writing at the CA device node. However, two drivers (dst_ca and firedtv-ci) also implement it via ioctls. Apparently, on both cases, the net result is the same. Anyway, let's document it. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: ca docs: document CA_SET_DESCR ioctl and structsMauro Carvalho Chehab
The av7110 driver uses CA_SET_DESCR to store the descrambler control words at the CA descrambler slots. Document it. Thanks-to: Honza Petrouš <jpetrous@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: net.h: add kernel-doc and use it at Documentation/Mauro Carvalho Chehab
As we did with frontend.h, ca.h and dmx.h, move the struct definition to net.h. That should help to keep it updated, as more stuff gets added there. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: frontend.h: Avoid the term DVB when doesn't refer to a delivery systemMauro Carvalho Chehab
The DVB term can either refer to the subsystem or to a delivery system. Avoid it in the first case at the kernel-doc markups. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: ca.h: document most CA data typesMauro Carvalho Chehab
For most of the stuff there, documenting is easy, as the header file contains information. Yet, I was unable to document two data structs: ca_msg and ca_descr As those two structs are used by a few drivers, keep them. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: ca.h: get rid of CA_SET_PIDMauro Carvalho Chehab
This ioctl seems to be some attempt to support a feature at the bt8xx dst_ca driver. Yet, as said there, it "needs more work". Right now, the code there is just a boilerplate. At the end of the day, no driver uses this ioctl, nor it is documented anywhere (except for "needs more work"). So, get rid of it. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dmx.h: add kernel-doc markups and use it at Documentation/Mauro Carvalho Chehab
The demux documentation is pretty poor nowadays: most of the structs and enums aren't documented at all. Add proper kernel-doc markups for them and use it. Now, the demux API data structures are fully documented :-) Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dmx.h: get rid of DMX_SET_SOURCEMauro Carvalho Chehab
No driver uses this ioctl, nor it is documented anywhere. So, get rid of it. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dmx.h: get rid of DMX_GET_CAPSMauro Carvalho Chehab
There's no driver currently using it; it is also not documented about what it would be supposed to do. So, get rid of it. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dmx.h: get rid of unused DMX_KERNEL_CLIENTMauro Carvalho Chehab
There's a flag defined for Digital TV demux that is not used anywhere, called DMX_KERNEL_CLIENT. Get rid of it. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dvb frontend docs: use kernel-doc documentationMauro Carvalho Chehab
Now that frontend.h contains most documentation for the frontend, remove the duplicated information from Documentation/ and use the kernel-doc auto-generated one instead. That should simplify maintainership of DVB frontend uAPI, as most of the documentation will stick with the header file. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dvb/frontend.h: document the uAPI fileMauro Carvalho Chehab
Most of the stuff at the Digital TV frontend header file are documented only at the Documentation. However, a few kernel-doc markups are there, several of them with parsing issues. Add the missing documentation, copying definitions from the Documentation when it applies, fixing some bugs. Please notice that DVBv3 stuff that were deprecated weren't commented by purpose. Instead, they were clearly tagged as such. This patch prepares to move part of the documentation from Documentation/ to kernel-doc comments. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dvb/frontend.h: move out a private internal structureMauro Carvalho Chehab
struct dtv_cmds_h is just an ancillary struct used by the dvb_frontend.c to internally store frontend commands. It doesn't belong to the userspace header, nor it is used anywhere, except inside the DVB core. So, remove it from the header. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: dmx.h: split typedefs from structsMauro Carvalho Chehab
Using typedefs inside the Kernel is against CodingStyle, and there's no good usage here. Just like we did at frontend.h, at commit 0df289a209e0 ("[media] dvb: Get rid of typedev usage for enums"), let's keep those typedefs only to provide userspace backward compatibility. No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-05media: ca.h: split typedefs from structsMauro Carvalho Chehab
Using typedefs inside the Kernel is against CodingStyle, and there's no good usage here. Just like we did at frontend.h, at commit 0df289a209e0 ("[media] dvb: Get rid of typedev usage for enums"), let's keep those typedefs only to provide userspace backward compatibility. No functional changes. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2017-09-04Merge branch 'x86-asm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 asm updates from Ingo Molnar: - Introduce the ORC unwinder, which can be enabled via CONFIG_ORC_UNWINDER=y. The ORC unwinder is a lightweight, Linux kernel specific debuginfo implementation, which aims to be DWARF done right for unwinding. Objtool is used to generate the ORC unwinder tables during build, so the data format is flexible and kernel internal: there's no dependency on debuginfo created by an external toolchain. The ORC unwinder is almost two orders of magnitude faster than the (out of tree) DWARF unwinder - which is important for perf call graph profiling. It is also significantly simpler and is coded defensively: there has not been a single ORC related kernel crash so far, even with early versions. (knock on wood!) But the main advantage is that enabling the ORC unwinder allows CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel measurably: With frame pointers disabled, GCC does not have to add frame pointer instrumentation code to every function in the kernel. The kernel's .text size decreases by about 3.2%, resulting in better cache utilization and fewer instructions executed, resulting in a broad kernel-wide speedup. Average speedup of system calls should be roughly in the 1-3% range - measurements by Mel Gorman [1] have shown a speedup of 5-10% for some function execution intense workloads. The main cost of the unwinder is that the unwinder data has to be stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel config - which is a modest cost on modern x86 systems. Given how young the ORC unwinder code is it's not enabled by default - but given the performance advantages the plan is to eventually make it the default unwinder on x86. See Documentation/x86/orc-unwinder.txt for more details. - Remove lguest support: its intended role was that of a temporary proof of concept for virtualization, plus its removal will enable the reduction (removal) of the paravirt API as well, so Rusty agreed to its removal. (Juergen Gross) - Clean up and fix FSGS related functionality (Andy Lutomirski) - Clean up IO access APIs (Andy Shevchenko) - Enhance the symbol namespace (Jiri Slaby) * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) objtool: Handle GCC stack pointer adjustment bug x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone() x86/fpu/math-emu: Add ENDPROC to functions x86/boot/64: Extract efi_pe_entry() from startup_64() x86/boot/32: Extract efi_pe_entry() from startup_32() x86/lguest: Remove lguest support x86/paravirt/xen: Remove xen_patch() objtool: Fix objtool fallthrough detection with function padding x86/xen/64: Fix the reported SS and CS in SYSCALL objtool: Track DRAP separately from callee-saved registers objtool: Fix validate_branch() return codes x86: Clarify/fix no-op barriers for text_poke_bp() x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs selftests/x86/fsgsbase: Test selectors 1, 2, and 3 x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common x86/asm: Fix UNWIND_HINT_REGS macro for older binutils x86/asm/32: Fix regs_get_register() on segment registers x86/xen/64: Rearrange the SYSCALL entries x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads ...
2017-09-04Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "Kernel side changes: - Add branch type profiling/tracing support. (Jin Yao) - Add the PERF_SAMPLE_PHYS_ADDR ABI to allow the tracing/profiling of physical memory addresses, where the PMU supports it. (Kan Liang) - Export some PMU capability details in the new /sys/bus/event_source/devices/cpu/caps/ sysfs directory. (Andi Kleen) - Aux data fixes and updates (Will Deacon) - kprobes fixes and updates (Masami Hiramatsu) - AMD uncore PMU driver fixes and updates (Janakarajan Natarajan) On the tooling side, here's a (limited!) list of highlights - there were many other changes that I could not list, see the shortlog and git history for details: UI improvements: - Implement a visual marker for fused x86 instructions in the annotate TUI browser, available now in 'perf report', more work needed to have it available as well in 'perf top' (Jin Yao) Further explanation from one of Jin's patches: │ ┌──cmpl $0x0,argp_program_version_hook 81.93 │ ├──je 20 │ │ lock cmpxchg %esi,0x38a9a4(%rip) │ │↓ jne 29 │ │↓ jmp 43 11.47 │20:└─→cmpxch %esi,0x38a999(%rip) That means the cmpl+je is a fused instruction pair and they should be considered together. - Record the branch type and then show statistics and info about in callchain entries (Jin Yao) Example from one of Jin's patches: # perf record -g -j any,save_type # perf report --branch-history --stdio --no-children 38.50% div.c:45 [.] main div | ---main div.c:42 (RET CROSS_2M cycles:2) compute_flag div.c:28 (cycles:2) compute_flag div.c:27 (RET CROSS_2M cycles:1) rand rand.c:28 (cycles:1) rand rand.c:28 (RET CROSS_2M cycles:1) __random random.c:298 (cycles:1) __random random.c:297 (COND_BWD CROSS_2M cycles:1) __random random.c:295 (cycles:1) __random random.c:295 (COND_BWD CROSS_2M cycles:1) __random random.c:295 (cycles:1) __random random.c:295 (RET CROSS_2M cycles:9) namespaces support: - Add initial support for namespaces, using setns to access files in namespaces, grabbing their build-ids, etc. (Krister Johansen) perf trace enhancements: - Beautify pkey_{alloc,free,mprotect} arguments in 'perf trace' (Arnaldo Carvalho de Melo) - Add initial 'clone' syscall args beautifier in 'perf trace' (Arnaldo Carvalho de Melo) - Ignore 'fd' and 'offset' args for MAP_ANONYMOUS in 'perf trace' (Arnaldo Carvalho de Melo) - Beautifiers for the 'cmd' arg of several ioctl types, including: sound, DRM, KVM, vhost virtio and perf_events. (Arnaldo Carvalho de Melo) - Add PERF_SAMPLE_CALLCHAIN and PERF_RECORD_MMAP[2] to 'perf data' CTF conversion, allowing CTF trace visualization tools to show callchains and to resolve symbols (Geneviève Bastien) - Beautify the fcntl syscall, which is an interesting one in the sense that infrastructure had to be put in place to change the formatters of some arguments according to the value in a previous one, i.e. cmd dictates how arg and the syscall return will be formatted. (Arnaldo Carvalho de Melo perf stat enhancements: - Use group read for event groups in 'perf stat', reducing overhead when groups are defined in the event specification, i.e. when using {} to enclose a list of events, asking them to be read at the same time, e.g.: "perf stat -e '{cycles,instructions}'" (Jiri Olsa) pipe mode improvements: - Process tracing data in 'perf annotate' pipe mode (David Carrillo-Cisneros) - Add header record types to pipe-mode, now this command: $ perf record -o - -e cycles sleep 1 | perf report --stdio --header Will show the same as in non-pipe mode, i.e. involving a perf.data file (David Carrillo-Cisneros) Vendor specific hardware event support updates/enhancements: - Update POWER9 vendor events tables (Sukadev Bhattiprolu) - Add POWER9 PMU events Sukadev (Bhattiprolu) - Support additional POWER8+ PVR in PMU mapfile (Shriya) - Add Skylake server uncore JSON vendor events (Andi Kleen) - Support exporting Intel PT data to sqlite3 with python perf scripts, this is in addition to the postgresql support that was already there (Adrian Hunter)" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (253 commits) perf symbols: Fix plt entry calculation for ARM and AARCH64 perf probe: Fix kprobe blacklist checking condition perf/x86: Fix caps/ for !Intel perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR perf/core, pt, bts: Get rid of itrace_started perf trace beauty: Beautify pkey_{alloc,free,mprotect} arguments tools headers: Sync cpu features kernel ABI headers with tooling headers perf tools: Pass full path of FEATURES_DUMP perf tools: Robustify detection of clang binary tools lib: Allow external definition of CC, AR and LD perf tools: Allow external definition of flex and bison binary names tools build tests: Don't hardcode gcc name perf report: Group stat values on global event id perf values: Zero value buffers perf values: Fix allocation check perf values: Fix thread index bug perf report: Add dump_read function perf record: Set read_format for inherit_stat perf c2c: Fix remote HITM detection for Skylake perf tools: Fix static build with newer toolchains ...
2017-09-04netlink: add NLM_F_NONREC flag for deletion requestsPablo Neira Ayuso
In the last NFWS in Faro, Portugal, we discussed that netlink is lacking the semantics to request non recursive deletions, ie. do not delete an object iff it has child objects that hang from this parent object that the user requests to be deleted. We need this new flag to solve a problem for the iptables-compat backward compatibility utility, that runs iptables commands using the existing nf_tables netlink interface. Specifically, custom chains in iptables cannot be deleted if there are rules in it, however, nf_tables allows to remove any chain that is populated with content. To sort out this asymmetry, iptables-compat userspace sets this new NLM_F_NONREC flag to obtain the same semantics that iptables provides. This new flag should only be used for deletion requests. Note this new flag value overlaps with the existing: * NLM_F_ROOT for get requests. * NLM_F_REPLACE for new requests. However, those flags should not ever be used in deletion requests. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnad: "The main RCU related changes in this cycle were: - Removal of spin_unlock_wait() - SRCU updates - RCU torture-test updates - RCU Documentation updates - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant - Miscellaneous RCU fixes - CPU-hotplug fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits) arch: Remove spin_unlock_wait() arch-specific definitions locking: Remove spin_unlock_wait() generic definitions drivers/ata: Replace spin_unlock_wait() with lock/unlock pair ipc: Replace spin_unlock_wait() with lock/unlock pair exit: Replace spin_unlock_wait() with lock/unlock pair completion: Replace spin_unlock_wait() with lock/unlock pair doc: Set down RCU's scheduling-clock-interrupt needs doc: No longer allowed to use rcu_dereference on non-pointers doc: Add RCU files to docbook-generation files doc: Update memory-barriers.txt for read-to-write dependencies doc: Update RCU documentation membarrier: Provide expedited private command rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter() rcu: Add warning to rcu_idle_enter() for irqs enabled rcu: Make rcu_idle_enter() rely on callers disabling irqs rcu: Add assertions verifying blocked-tasks list rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit() rcu: Add TPS() protection for _rcu_barrier_trace strings rcu: Use idle versions of swait to make idle-hack clear swait: Add idle variants which don't contribute to load average ...
2017-09-04netfilter: nft_limit: add stateful object typePablo M. Bermudo Garay
Register a new limit stateful object type into the stateful object infrastructure. Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-04netfilter: xt_hashlimit: add rate match modeVishwanath Pai
This patch adds a new feature to hashlimit that allows matching on the current packet/byte rate without rate limiting. This can be enabled with a new flag --hashlimit-rate-match. The match returns true if the current rate of packets is above/below the user specified value. The main difference between the existing algorithm and the new one is that the existing algorithm rate-limits the flow whereas the new algorithm does not. Instead it *classifies* the flow based on whether it is above or below a certain rate. I will demonstrate this with an example below. Let us assume this rule: iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain If the packet rate is 15/s, the existing algorithm would ACCEPT 10 packets every second and send 5 packets to "new_chain". But with the new algorithm, as long as the rate of 15/s is sustained, all packets will continue to match and every packet is sent to new_chain. This new functionality will let us classify different flows based on their current rate, so that further decisions can be made on them based on what the current rate is. This is how the new algorithm works: We divide time into intervals of 1 (sec/min/hour) as specified by the user. We keep track of the number of packets/bytes processed in the current interval. After each interval we reset the counter to 0. When we receive a packet for match, we look at the packet rate during the current interval and the previous interval to make a decision: if [ prev_rate < user and cur_rate < user ] return Below else return Above Where cur_rate is the number of packets/bytes seen in the current interval, prev is the number of packets/bytes seen in the previous interval and 'user' is the rate specified by the user. We also provide flexibility to the user for choosing the time interval using the option --hashilmit-interval. For example the user can keep a low rate like x/hour but still keep the interval as small as 1 second. To preserve backwards compatibility we have to add this feature in a new revision, so I've created revision 3 for hashlimit. The two new options we add are: --hashlimit-rate-match --hashlimit-rate-interval I have updated the help text to add these new options. Also added a few tests for the new options. Suggested-by: Igor Lubashev <ilubashe@akamai.com> Reviewed-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2017-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree. Basically, updates to the conntrack core, enhancements for nf_tables, conversion of netfilter hooks from linked list to array to improve memory locality and asorted improvements for the Netfilter codebase. More specifically, they are: 1) Add expection to hashes after timer initialization to prevent access from another CPU that walks on the hashes and calls del_timer(), from Florian Westphal. 2) Don't update nf_tables chain counters from hot path, this is only used by the x_tables compatibility layer. 3) Get rid of nested rcu_read_lock() calls from netfilter hook path. Hooks are always guaranteed to run from rcu read side, so remove nested rcu_read_lock() where possible. Patch from Taehee Yoo. 4) nf_tables new ruleset generation notifications include PID and name of the process that has updated the ruleset, from Phil Sutter. 5) Use skb_header_pointer() from nft_fib, so we can reuse this code from the nf_family netdev family. Patch from Pablo M. Bermudo. 6) Add support for nft_fib in nf_tables netdev family, also from Pablo. 7) Use deferrable workqueue for conntrack garbage collection, to reduce power consumption, from Patch from Subash Abhinov Kasiviswanathan. 8) Add nf_ct_expect_iterate_net() helper and use it. From Florian Westphal. 9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian. 10) Drop references on conntrack removal path when skbuffs has escaped via nfqueue, from Florian. 11) Don't queue packets to nfqueue with dying conntrack, from Florian. 12) Constify nf_hook_ops structure, from Florian. 13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter. 14) Add nla_strdup(), from Phil Sutter. 15) Rise nf_tables objects name size up to 255 chars, people want to use DNS names, so increase this according to what RFC 1035 specifies. Patch series from Phil Sutter. 16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook registration on demand, suggested by Eric Dumazet, patch from Florian. 17) Remove unused variables in compat_copy_entry_from_user both in ip_tables and arp_tables code. Patch from Taehee Yoo. 18) Constify struct nf_conntrack_l4proto, from Julia Lawall. 19) Constify nf_loginfo structure, also from Julia. 20) Use a single rb root in connlimit, from Taehee Yoo. 21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo. 22) Use audit_log() instead of open-coding it, from Geliang Tang. 23) Allow to mangle tcp options via nft_exthdr, from Florian. 24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes a fix for a miscalculation of the minimal length. 25) Simplify branch logic in h323 helper, from Nick Desaulniers. 26) Calculate netlink attribute size for conntrack tuple at compile time, from Florian. 27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure. From Florian. 28) Remove holes in nf_conntrack_l4proto structure, so it becomes smaller. From Florian. 29) Get rid of print_tuple() indirection for /proc conntrack listing. Place all the code in net/netfilter/nf_conntrack_standalone.c. Patch from Florian. 30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is off. From Florian. 31) Constify most nf_conntrack_{l3,l4}proto helper functions, from Florian. 32) Fix broken indentation in ebtables extensions, from Colin Ian King. 33) Fix several harmless sparse warning, from Florian. 34) Convert netfilter hook infrastructure to use array for better memory locality, joint work done by Florian and Aaron Conole. Moreover, add some instrumentation to debug this. 35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once per batch, from Florian. 36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian. 37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao. 38) Remove unused code in the generic protocol tracker, from Davide Caratti. I think I will have material for a second Netfilter batch in my queue if time allow to make it fit in this merge window. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-03Merge tag 'drm-for-v4.14' of git://people.freedesktop.org/~airlied/linuxLinus Torvalds
Pull drm updates from Dave Airlie: "This is the main drm pull request for 4.14 merge window. I'm sending this early, as my continuing journey into fatherhood is occurring really soon now, I'm going to be mostly useless for the next couple of weeks, though I may be able to read email, I doubt I'll be doing much patch applications or git sending. If anything urgent pops up I've asked Daniel/Jani/Alex/Sean to try and direct stuff towards you. Outside drm changes: Some rcar-du updates that touch the V4L tree, all acks should be in place. It adds one export to the radix tree code for new i915 use case. There are some minor AGP cleanups (don't see that too often). Changes to the vbox driver in staging to avoid breaking compilation. Summary: core: - Atomic helper fixes - Atomic UAPI fixes - Add YCBCR 4:2:0 support - Drop set_busid hook - Refactor fb_helper locking - Remove a bunch of internal APIs - Add a bunch of better default handlers - Format modifier/blob plane property added - More internal header refactoring - Make more internal API names consistent - Enhanced syncobj APIs (wait/signal/reset/create signalled) bridge: - Add Synopsys Designware MIPI DSI host bridge driver tiny: - Add Pervasive Displays RePaper displays - Add support for LEGO MINDSTORMS EV3 LCD i915: - Lots of GEN10/CNL support patches - drm syncobj support - Skylake+ watermark refactoring - GVT vGPU 48-bit ppgtt support - GVT performance improvements - NOA change ioctl - CCS (color compression) scanout support - GPU reset improvements amdgpu: - Initial hugepage support - BO migration logic rework - Vega10 improvements - Powerplay fixes - Stop reprogramming the MC - Fixes for ACP audio on stoney - SR-IOV fixes/improvements - Command submission overhead improvements amdkfd: - Non-dGPU upstreaming patches - Scratch VA ioctl - Image tiling modes - Update PM4 headers for new firmware - Drop all BUG_ONs. nouveau: - GP108 modesetting support. - Disable MSI on big endian. vmwgfx: - Add fence fd support. msm: - Runtime PM improvements exynos: - NV12MT support - Refactor KMS drivers imx-drm: - Lock scanout channel to improve memory bw - Cleanups etnaviv: - GEM object population fixes tegra: - Prep work for Tegra186 support - PRIME mmap support sunxi: - HDMI support improvements - HDMI CEC support omapdrm: - HDMI hotplug IRQ support - Big driver cleanup - OMAP5 DSI support rcar-du: - vblank fixes - VSP1 updates arcgpu: - Minor fixes stm: - Add STM32 DSI controller driver dw_hdmi: - Add support for Rockchip RK3399 - HDMI CEC support atmel-hlcdc: - Add 8-bit color support vc4: - Atomic fixes - New ioctl to attach a label to a buffer object - HDMI CEC support - Allow userspace to dictate rendering order on submit ioctl" * tag 'drm-for-v4.14' of git://people.freedesktop.org/~airlied/linux: (1074 commits) drm/syncobj: Add a signal ioctl (v3) drm/syncobj: Add a reset ioctl (v3) drm/syncobj: Add a syncobj_array_find helper drm/syncobj: Allow wait for submit and signal behavior (v5) drm/syncobj: Add a CREATE_SIGNALED flag drm/syncobj: Add a callback mechanism for replace_fence (v3) drm/syncobj: add sync obj wait interface. (v8) i915: Use drm_syncobj_fence_get drm/syncobj: Add a race-free drm_syncobj_fence_get helper (v2) drm/syncobj: Rename fence_get to find_fence drm: kirin: Add mode_valid logic to avoid mode clocks we can't generate drm/vmwgfx: Bump the version for fence FD support drm/vmwgfx: Add export fence to file descriptor support drm/vmwgfx: Add support for imported Fence File Descriptor drm/vmwgfx: Prepare to support fence fd drm/vmwgfx: Fix incorrect command header offset at restart drm/vmwgfx: Support the NOP_ERROR command drm/vmwgfx: Restart command buffers after errors drm/vmwgfx: Move irq bottom half processing to threads drm/vmwgfx: Don't use drm_irq_[un]install ...
2017-09-03Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc fixes from Al Viro: "Loose ends and regressions from the last merge window. Strictly speaking, only binfmt_flat thing is a build regression per se - the rest is 'only sparse cares about that' stuff" [ This came in before the 4.13 release and could have gone there, but it was late in the release and nothing seemed critical enough to care, so I'm pulling it in the 4.14 merge window instead - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: binfmt_flat: fix arch/m32r and arch/microblaze flat_put_addr_at_rp() compat_hdio_ioctl: Fix a declaration <linux/uaccess.h>: Fix copy_in_user() declaration annotate RWF_... flags teach SYSCALL_DEFINE/COMPAT_SYSCALL_DEFINE to handle __bitwise arguments
2017-09-01tcp_diag: report TCP MD5 signing keys and addressesIvan Delalande
Report TCP MD5 (RFC2385) signing keys, addresses and address prefixes to processes with CAP_NET_ADMIN requesting INET_DIAG_INFO. Currently it is not possible to retrieve these from the kernel once they have been configured on sockets. Signed-off-by: Ivan Delalande <colona@arista.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Three cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-01fsmap: fix documentation of FMR_OF_LASTDarrick J. Wong
The FMR_OF_LAST flag is set on the last fsmap record being returned for the dataset requested, contrary to what the header file says. Fix the docs to reflect the behavior of all fsmap implementations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2017-09-01Introduce v3 namespaced file capabilitiesSerge E. Hallyn
Root in a non-initial user ns cannot be trusted to write a traditional security.capability xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the xattr, and execute the file with privilege on the host. However supporting file capabilities in a user namespace is very desirable. Not doing so means that any programs designed to run with limited privilege must continue to support other methods of gaining and dropping privilege. For instance a program installer must detect whether file capabilities can be assigned, and assign them if so but set setuid-root otherwise. The program in turn must know how to drop partial capabilities, and do so only if setuid-root. This patch introduces v3 of the security.capability xattr. It builds a vfs_ns_cap_data struct by appending a uid_t rootid to struct vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user namespace which mounted the filesystem, usually init_user_ns) of the root id in whose namespaces the file capabilities may take effect. When a task asks to write a v2 security.capability xattr, if it is privileged with respect to the userns which mounted the filesystem, then nothing should change. Otherwise, the kernel will transparently rewrite the xattr as a v3 with the appropriate rootid. This is done during the execution of setxattr() to catch user-space-initiated capability writes. Subsequently, any task executing the file which has the noted kuid as its root uid, or which is in a descendent user_ns of such a user_ns, will run the file with capabilities. Similarly when asking to read file capabilities, a v3 capability will be presented as v2 if it applies to the caller's namespace. If a task writes a v3 security.capability, then it can provide a uid for the xattr so long as the uid is valid in its own user namespace, and it is privileged with CAP_SETFCAP over its namespace. The kernel will translate that rootid to an absolute uid, and write that to disk. After this, a task in the writer's namespace will not be able to use those capabilities (unless rootid was 0), but a task in a namespace where the given uid is root will. Only a single security.capability xattr may exist at a time for a given file. A task may overwrite an existing xattr so long as it is privileged over the inode. Note this is a departure from previous semantics, which required privilege to remove a security.capability xattr. This check can be re-added if deemed useful. This allows a simple setxattr to work, allows tar/untar to work, and allows us to tar in one namespace and untar in another while preserving the capability, without risking leaking privilege into a parent namespace. Example using tar: $ cp /bin/sleep sleepx $ mkdir b1 b2 $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1 $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2 $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx b2/sleepx = cap_sys_admin+ep # /opt/ltp/testcases/bin/getv3xattr b2/sleepx v3 xattr, rootid is 100001 A patch to linux-test-project adding a new set of tests for this functionality is in the nsfscaps branch at github.com/hallyn/ltp Changelog: Nov 02 2016: fix invalid check at refuse_fcap_overwrite() Nov 07 2016: convert rootid from and to fs user_ns (From ebiederm: mar 28 2017) commoncap.c: fix typos - s/v4/v3 get_vfs_caps_from_disk: clarify the fs_ns root access check nsfscaps: change the code split for cap_inode_setxattr() Apr 09 2017: don't return v3 cap for caps owned by current root. return a v2 cap for a true v2 cap in non-init ns Apr 18 2017: . Change the flow of fscap writing to support s_user_ns writing. . Remove refuse_fcap_overwrite(). The value of the previous xattr doesn't matter. Apr 24 2017: . incorporate Eric's incremental diff . move cap_convert_nscap to setxattr and simplify its usage May 8, 2017: . fix leaking dentry refcount in cap_inode_getsecurity Signed-off-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-09-01ANDROID: binder: Add BINDER_GET_NODE_DEBUG_INFO ioctlColin Cross
The BINDER_GET_NODE_DEBUG_INFO ioctl will return debug info on a node. Each successive call reusing the previous return value will return the next node. The data will be used by libmemunreachable to mark the pointers with kernel references as reachable. Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Martijn Coenen <maco@android.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-01bpf: Add mark and priority to sock options that can be setDavid Ahern
Add socket mark and priority to fields that can be set by ebpf program when a socket is created. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-31devlink: Add IPv6 header for dpipeArkadi Sharshevsky
This will be used by the IPv6 host table which will be introduced in the following patches. The fields in the header are added per-use. This header is global and can be reused by many drivers. Signed-off-by: Arkadi Sharshevsky <arkadis@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-31annotate RWF_... flagsChristoph Hellwig
[AV: added missing annotations in syscalls.h/compat.h] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-31loop: add ioctl for changing logical block sizeOmar Sandoval
This is a different approach from the first attempt in f2c6df7dbf9a ("loop: support 4k physical blocksize"). Rather than extending LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block size. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-31PCI/AER: Reformat AER register definitionsBjorn Helgaas
Reformat so comments fit on same line as definition. No functional change intended. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-08-31KVM: PPC: Book3S HV: Report storage key support to userspacePaul Mackerras
This adds information about storage keys to the struct returned by the KVM_PPC_GET_SMMU_INFO ioctl. The new fields replace a pad field, which was zeroed by previous kernel versions. Thus userspace that knows about the new fields will see zeroes when running on an older kernel, indicating that storage keys are not supported. The size of the structure has not changed. The number of keys is hard-coded for the CPUs supported by HV KVM, which is just POWER7, POWER8 and POWER9. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-30net: arp: Add support for raw IP deviceSubash Abhinov Kasiviswanathan
Define the raw IP type. This is needed for raw IP net devices like rmnet. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30net: ether: Add support for multiplexing and aggregation typeSubash Abhinov Kasiviswanathan
Define the Qualcomm multiplexing and aggregation (MAP) ether type 0x00F9. This is needed for receiving data in the MAP protocol like RMNET. This is not an officially registered ID. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-30tcp: Revert "tcp: remove header prediction"Florian Westphal
This reverts commit 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78. Eric Dumazet says: We found at Google a significant regression caused by 45f119bf936b1f9f546a0b139c5b56f9bb2bdc78 tcp: remove header prediction In typical RPC (TCP_RR), when a TCP socket receives data, we now call tcp_ack() while we used to not call it. This touches enough cache lines to cause a slowdown. so problem does not seem to be HP removal itself but the tcp_ack() call. Therefore, it might be possible to remove HP after all, provided one finds a way to elide tcp_ack for most cases. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-29ether: add NSH ethertypeJiri Benc
The NSH draft says: An IEEE EtherType, 0x894F, has been allocated for NSH. Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>