summaryrefslogtreecommitdiffstats
path: root/kernel (follow)
Commit message (Collapse)AuthorAgeFilesLines
* mm, uprobes: convert __replace_page() to use page_vma_mapped_walk()Kirill A. Shutemov2017-02-251-8/+14
| | | | | | | | | | | | | | | | | | | For consistency, it worth converting all page_check_address() to page_vma_mapped_walk(), so we could drop the former. Link: http://lkml.kernel.org/r/20170129173858.45174-10-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* uprobes: split THPs before trying to replace themKirill A. Shutemov2017-02-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Fix few rmap-related THP bugs", v3. The patchset fixes handing PTE-mapped THPs in page_referenced() and page_idle_clear_pte_refs(). To achieve that I've intrdocued new helper -- page_vma_mapped_walk() -- which replaces all page_check_address{,_transhuge}() and covers all THP cases. Patchset overview: - First patch fixes one uprobe bug (unrelated to the rest of the patchset, just spotted it at the same time); - Patches 2-5 fix handling PTE-mapped THPs in page_referenced(), page_idle_clear_pte_refs() and rmap core; - Patches 6-12 convert all page_check_address{,_transhuge}() users (plus remove_migration_pte()) to page_vma_mapped_walk() and drop unused helpers. I think the fixes are not critical enough for stable@ as they don't lead to crashes or hangs, only suboptimal behaviour. This patch (of 12): For THPs page_check_address() always fails. It leads to endless loop in uprobe_write_opcode(). Testcase with huge-tmpfs (uprobes cannot probe anonymous memory). mount -t debugfs none /sys/kernel/debug mount -t tmpfs -o huge=always none /mnt gcc -Wall -O2 -o /mnt/test -x c - <<EOF int main(void) { return 0; } /* Padding to map the code segment with huge pmd */ asm (".zero 2097152"); EOF echo 'p /mnt/test:0' > /sys/kernel/debug/tracing/uprobe_events echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable /mnt/test Let's split THPs before trying to replace. Link: http://lkml.kernel.org/r/20170129173858.45174-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang2017-02-252-5/+5
| | | | | | | | | | | | | | | | | | | | | | | ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin, done}Dan Williams2017-02-251-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | The mem_hotplug_{begin,done} lock coordinates with {get,put}_online_mems() to hold off "readers" of the current state of memory from new hotplug actions. mem_hotplug_begin() expects exclusive access, via the device_hotplug lock, to set mem_hotplug.active_writer. Calling mem_hotplug_begin() without locking device_hotplug can lead to corrupting mem_hotplug.refcount and missed wakeups / soft lockups. [dan.j.williams@intel.com: v2] Link: http://lkml.kernel.org/r/148728203365.38457.17804568297887708345.stgit@dwillia2-desk3.amr.corp.intel.com Link: http://lkml.kernel.org/r/148693885680.16345.17802627926777862337.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: f931ab479dd2 ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2017-02-245-11/+76
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
| * prctl: propagate has_child_subreaper flag to every descendantPavel Tikhomirov2017-02-032-3/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group. To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Moreover, I doubt someone intentionaly pre-forks the children whitch should reparent to init before becoming subreaper, because some our ancestor migh have had is_child_subreaper flag while forking our sub-tree and our childs will all inherit has_child_subreaper flag, and we have no way to influence it. And only way to check if we have no has_child_subreaper flag is to create some childs, kill them and see where they will reparent to. Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing seems to be the same. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. * for a) to be true need to move has_child_subreaper inheritance under the same tasklist_lock with adding task to its ->real_parent->children as without it process can inherit zero has_child_subreaper, then we set 1 to it's parent flag, check that parent has no more children, and only after child with wrong flag is added to the tree. * Also make these inheritance more clear by using real_parent instead of current, as on clone(CLONE_PARENT) if current has is_child_subreaper and real_parent has no is_child_subreaper or has_child_subreaper, child will have has_child_subreaper flag set without actually having a subreaper in it's ancestors. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. v2: switch to walk_process_tree() general helper, move has_child_subreaper inheritance v3: remove csr_descendant leftover, change current to real_parent in has_child_subreaper inheritance v4: small commit message fix Fixes: ebec18a6d3aa ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * introduce the walk_process_tree() helperOleg Nesterov2017-02-031-0/+32
| | | | | | | | | | | | | | | | | | | | | | Add the new helper to walk the process tree, the next patch adds a user. Note that it visits the group leaders only, proc_visitor can do for_each_thread itself or we can trivially extend walk_process_tree() to do this. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * fs: Better permission checking for submountsEric W. Biederman2017-02-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support unprivileged users mounting filesystems two permission checks have to be performed: a test to see if the user allowed to create a mount in the mount namespace, and a test to see if the user is allowed to access the specified filesystem. The automount case is special in that mounting the original filesystem grants permission to mount the sub-filesystems, to any user who happens to stumble across the their mountpoint and satisfies the ordinary filesystem permission checks. Attempting to handle the automount case by using override_creds almost works. It preserves the idea that permission to mount the original filesystem is permission to mount the sub-filesystem. Unfortunately using override_creds messes up the filesystems ordinary permission checks. Solve this by being explicit that a mount is a submount by introducing vfs_submount, and using it where appropriate. vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let sget and friends know that a mount is a submount so they can take appropriate action. sget and sget_userns are modified to not perform any permission checks on submounts. follow_automount is modified to stop using override_creds as that has proven problemantic. do_mount is modified to always remove the new MS_SUBMOUNT flag so that we know userspace will never by able to specify it. autofs4 is modified to stop using current_real_cred that was put in there to handle the previous version of submount permission checking. cifs is modified to pass the mountpoint all of the way down to vfs_submount. debugfs is modified to pass the mountpoint all of the way down to trace_automount by adding a new parameter. To make this change easier a new typedef debugfs_automount_t is introduced to capture the type of the debugfs automount function. Cc: stable@vger.kernel.org Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid") Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds") Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * exit: fix the setns() && PR_SET_CHILD_SUBREAPER interactionOleg Nesterov2017-02-011-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | find_new_reaper() checks same_thread_group(reaper, child_reaper) to prevent the cross-namespace reparenting but this is not enough if the exiting parent was injected by setns() + fork(). Suppose we have a process P in the root namespace and some namespace X. P does setns() to enter the X namespace, and forks the child C. C forks a grandchild G and exits. The grandchild G should be re-parented to X->child_reaper, but in this case the ->real_parent chain does not lead to ->child_reaper, so it will be wrongly reparanted to P's sub-reaper or a global init. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * inotify: Convert to using per-namespace limitsNikolay Borisov2017-01-241-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patchset converts inotify to using the newly introduced per-userns sysctl infrastructure. Currently the inotify instances/watches are being accounted in the user_struct structure. This means that in setups where multiple users in unprivileged containers map to the same underlying real user (i.e. pointing to the same user_struct) the inotify limits are going to be shared as well, allowing one user(or application) to exhaust all others limits. Fix this by switching the inotify sysctls to using the per-namespace/per-user limits. This will allow the server admin to set sensible global limits, which can further be tuned inside every individual user namespace. Additionally, in order to preserve the sysctl ABI make the existing inotify instances/watches sysctls modify the values of the initial user namespace. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2017-02-231-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Some 'const'ing in qlogic networking drivers, from Bhumika Goyal. 2) Fix scheduling while atomic in l2tp network namespace exit by deferring the work to the workqueue. From Ridge Kennedy. 3) Fix use after free in dccp timewait handling, from Andrey Ryabinin. 4) mlx5e CQE compression engine not initialized properly, from Tariq Toukan. 5) Some UAPI header fixes from Dmitry V. Levin. 6) Don't overwrite module parameter value in mlx4 driver, from Majd Dibbiny. 7) Fix divide by zero in xt_hashlimit netfilter module, from Alban Browaeys. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) bpf: Fix bpf_xdp_event_output net/mlx4_en: Use __skb_fill_page_desc() net/mlx4_core: Use cq quota in SRIOV when creating completion EQs net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs net/mlx4: Spoofcheck and zero MAC can't coexist net/mlx4: Change ENOTSUPP to EOPNOTSUPP uapi: fix linux/rds.h userspace compilation errors uapi: fix linux/seg6.h and linux/seg6_iptunnel.h userspace compilation errors lib: Remove string from parman config selection forcedeth: Remove return from a void function bpf: fix spelling mistake: "proccessed" -> "processed" uapi: fix linux/llc.h userspace compilation error uapi: fix linux/ip6_tunnel.h userspace compilation errors net/mlx5e: Fix wrong CQE decompression net/mlx5e: Update MPWQE stride size when modifying CQE compress state net/mlx5e: Fix broken CQE compression initialization net/mlx5e: Do not reduce LRO WQE size when not using build_skb net/mlx5e: Register/unregister vport representors on interface attach/detach net/mlx5e: s390 system compilation fix tcp: account for ts offset only if tsecr not zero ...
| * | bpf: fix spelling mistake: "proccessed" -> "processed"Colin Ian King2017-02-231-1/+1
| | | | | | | | | | | | | | | | | | | | | trivial fix to spelling mistake in verbose log message Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'akpm' (patches from Andrew)Linus Torvalds2017-02-233-13/+60
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: "142 patches: - DAX updates - various misc bits - OCFS2 updates - most of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits) mm/z3fold.c: limit first_num to the actual range of possible buddy indexes mm: fix <linux/pagemap.h> stray kernel-doc notation zram: remove obsolete sysfs attrs mm/memblock.c: remove unnecessary log and clean up oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA mm: drop unused argument of zap_page_range() mm: drop zap_details::check_swap_entries mm: drop zap_details::ignore_dirty mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled mm: help __GFP_NOFAIL allocations which do not trigger OOM killer mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically mm: consolidate GFP_NOFAIL checks in the allocator slowpath lib/show_mem.c: teach show_mem to work with the given nodemask arch, mm: remove arch specific show_mem mm, page_alloc: warn_alloc print nodemask mm, page_alloc: do not report all nodes in show_mem Revert "mm: bail out in shrink_inactive_list()" mm, vmscan: consider eligible zones in get_scan_count mm, vmscan: cleanup lru size claculations mm, vmscan: do not count freed pages as PGDEACTIVATE ...
| * | | userfaultfd: non-cooperative: Add fork() eventPavel Emelyanov2017-02-231-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. [aarcange@redhat.com: build warning fix] Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | kernel/watchdog.c: do not hardcode CPU 0 as the initial threadPrarit Bhargava2017-02-231-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_BOOTPARAM_HOTPLUG_CPU0 is enabled, the socket containing the boot cpu can be replaced. During the hot add event, the message NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. is output implying that the NMI watchdog was disabled at some point. This is not the case and the message has caused confusion for users of systems that support the removal of the boot cpu socket. The watchdog code is coded to assume that cpu 0 is always the first cpu to initialize the watchdog, and the last to stop its watchdog thread. That is not the case for initializing if cpu 0 has been removed and added. The removal case has never been correct because the smpboot code will remove the watchdog threads starting with the lowest cpu number. This patch adds watchdog_cpus to track the number of cpus with active NMI watchdog threads so that the first and last thread can be used to set and clear the value of firstcpu_err. firstcpu_err is set when the first watchdog thread is enabled, and cleared when the last watchdog thread is disabled. Link: http://lkml.kernel.org/r/1480425321-32296-1-git-send-email-prarit@redhat.com Signed-off-by: Prarit Bhargava <prarit@redhat.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <ak@linux.intel.com> Cc: Joshua Hunt <johunt@akamai.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Babu Moger <babu.moger@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | tracing: add __print_flags_u64()Ross Zwisler2017-02-231-0/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "DAX tracepoints, mm argument simplification", v4. This contains both my DAX tracepoint code and Dave Jiang's MM argument simplifications. Dave's code was written with my tracepoint code as a baseline, so it seemed simplest to keep them together in a single series. This patch (of 7): Add __print_flags_u64() and the helper trace_print_flags_seq_u64() in the same spirit as __print_symbolic_u64() and trace_print_symbols_seq_u64(). These functions allow us to print symbols associated with flags that are 64 bits wide even on 32 bit machines. These will be used by the DAX code so that we can print the flags set in a pfn_t such as PFN_SG_CHAIN, PFN_SG_LAST, PFN_DEV and PFN_MAP. Without this new function I was getting errors like the following when compiling for i386: include/linux/pfn_t.h:13:22: warning: large integer implicitly truncated to unsigned type [-Woverflow] #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1)) ^ Link: http://lkml.kernel.org/r/1484085142-2297-2-git-send-email-ross.zwisler@linux.intel.com Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2017-02-231-5/+8
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull seccomp fix from James Morris: "A fix for a regression in the seccomp code (it was supposed to be in the first pull req but I had it queued in the wrong branch)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: seccomp: Only dump core when single-threaded
| * | | | seccomp: Only dump core when single-threadedKees Cook2017-02-221-5/+8
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SECCOMP_RET_KILL filter return code has always killed the current thread, not the entire process. Changing this as a side-effect of dumping core isn't a safe thing to do (a few test suites have already flagged this behavioral change). Instead, restore the RET_KILL semantics, but still dump core when a RET_KILL delivers SIGSYS to a single-threaded process. Fixes: b25e67161c29 ("seccomp: dump core when using SECCOMP_RET_KILL") Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2017-02-236-240/+313
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Add Petr Mladek, Sergey Senozhatsky as printk maintainers, and Steven Rostedt as the printk reviewer. This idea came up after the discussion about printk issues at Kernel Summit. It was formulated and discussed at lkml[1]. - Extend a lock-less NMI per-cpu buffers idea to handle recursive printk() calls by Sergey Senozhatsky[2]. It is the first step in sanitizing printk as discussed at Kernel Summit. The change allows to see messages that would normally get ignored or would cause a deadlock. Also it allows to enable lockdep in printk(). This already paid off. The testing in linux-next helped to discover two old problems that were hidden before[3][4]. - Remove unused parameter by Sergey Senozhatsky. Clean up after a past change. [1] http://lkml.kernel.org/r/1481798878-31898-1-git-send-email-pmladek@suse.com [2] http://lkml.kernel.org/r/20161227141611.940-1-sergey.senozhatsky@gmail.com [3] http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com [4] http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: printk: drop call_console_drivers() unused param printk: convert the rest to printk-safe printk: remove zap_locks() function printk: use printk_safe buffers in printk printk: report lost messages in printk safe/nmi contexts printk: always use deferred printk when flush printk_safe lines printk: introduce per-cpu safe_print seq buffer printk: rename nmi.c and exported api printk: use vprintk_func in vprintk() MAINTAINERS: Add printk maintainers
| * | | | printk: drop call_console_drivers() unused paramSergey Senozhatsky2017-02-081-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We do suppress_message_printing() check before we call call_console_drivers() now, so `level' param is not needed anymore. Link: http://lkml.kernel.org/r/20161224140902.1962-2-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
| * | | | printk: convert the rest to printk-safeSergey Senozhatsky2017-02-081-38/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts the rest of logbuf users (which are out of printk recursion case, but can deadlock in printk). To make printk-safe usage easier the patch introduces 4 helper macros: - logbuf_lock_irq()/logbuf_unlock_irq() lock/unlock the logbuf lock and disable/enable local IRQ - logbuf_lock_irqsave(flags)/logbuf_unlock_irqrestore(flags) lock/unlock the logbuf lock and saves/restores local IRQ state Link: http://lkml.kernel.org/r/20161227141611.940-9-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
| * | | | printk: remove zap_locks() functionSergey Senozhatsky2017-02-081-61/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use printk-safe now which makes printk-recursion detection code in vprintk_emit() unreachable. The tricky thing here is that, apart from detecting and reporting printk recursions, that code also used to zap_locks() in case of panic() from the same CPU. However, zap_locks() does not look to be needed anymore: 1) Since commit 08d78658f393 ("panic: release stale console lock to always get the logbuf printed out") panic flushing of `logbuf' to console ignores the state of `console_sem' by doing panic() console_trylock(); console_unlock(); 2) Since commit cf9b1106c81c ("printk/nmi: flush NMI messages on the system panic") panic attempts to zap the `logbuf_lock' spin_lock to successfully flush nmi messages to `logbuf'. Basically, it seems that we either already do what zap_locks() used to do but in other places or we ignore the state of the lock. The only reaming difference is that we don't re-init the console semaphore in printk_safe_flush_on_panic(), but this is not necessary because we don't call console drivers from printk_safe_flush_on_panic() due to the fact that we are using a deferred printk() version (as was suggested by Petr Mladek). Link: http://lkml.kernel.org/r/20161227141611.940-8-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
| * | | | printk: use printk_safe buffers in printkSergey Senozhatsky2017-02-081-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use printk_safe per-CPU buffers in printk recursion-prone blocks: -- around logbuf_lock protected sections in vprintk_emit() and console_unlock() -- around down_trylock_console_sem() and up_console_sem() Note that this solution addresses deadlocks caused by printk() recursive calls only. That is vprintk_emit() and console_unlock(). The rest will be converted in a followup patch. Another thing to note is that we now keep lockdep enabled in printk, because we are protected against the printk recursion caused by lockdep in vprintk_emit() by the printk-safe mechanism - we first switch to per-CPU buffers and only then access the deadlock-prone locks. Examples: 1) printk() from logbuf_lock spin_lock section Assume the following code: printk() raw_spin_lock(&logbuf_lock); WARN_ON(1); raw_spin_unlock(&logbuf_lock); which now produces: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 366 at kernel/printk/printk.c:1811 vprintk_emit CPU: 0 PID: 366 Comm: bash Call Trace: warn_slowpath_null+0x1d/0x1f vprintk_emit+0x1cd/0x438 vprintk_default+0x1d/0x1f printk+0x48/0x50 [..] 2) printk() from semaphore sem->lock spin_lock section Assume the following code printk() console_trylock() down_trylock() raw_spin_lock_irqsave(&sem->lock, flags); WARN_ON(1); raw_spin_unlock_irqrestore(&sem->lock, flags); which now produces: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 363 at kernel/locking/semaphore.c:141 down_trylock CPU: 1 PID: 363 Comm: bash Call Trace: warn_slowpath_null+0x1d/0x1f down_trylock+0x3d/0x62 ? vprintk_emit+0x3f9/0x414 console_trylock+0x31/0xeb vprintk_emit+0x3f9/0x414 vprintk_default+0x1d/0x1f printk+0x48/0x50 [..] 3) printk() from console_unlock() Assume the following code: printk() console_unlock() raw_spin_lock(&logbuf_lock); WARN_ON(1); raw_spin_unlock(&logbuf_lock); which now produces: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 329 at kernel/printk/printk.c:2384 console_unlock CPU: 1 PID: 329 Comm: bash Call Trace: warn_slowpath_null+0x18/0x1a console_unlock+0x12d/0x559 ? trace_hardirqs_on_caller+0x16d/0x189 ? trace_hardirqs_on+0xd/0xf vprintk_emit+0x363/0x374 vprintk_default+0x18/0x1a printk+0x43/0x4b [..] 4) printk() from try_to_wake_up() Assume the following code: printk() console_unlock() up() try_to_wake_up() raw_spin_lock_irqsave(&p->pi_lock, flags); WARN_ON(1); raw_spin_unlock_irqrestore(&p->pi_lock, flags); which now produces: ------------[ cut here ]------------ WARNING: CPU: 3 PID: 363 at kernel/sched/core.c:2028 try_to_wake_up CPU: 3 PID: 363 Comm: bash Call Trace: warn_slowpath_null+0x1d/0x1f try_to_wake_up+0x7f/0x4f7 wake_up_process+0x15/0x17 __up.isra.0+0x56/0x63 up+0x32/0x42 __up_console_sem+0x37/0x55 console_unlock+0x21e/0x4c2 vprintk_emit+0x41c/0x462 vprintk_default+0x1d/0x1f printk+0x48/0x50 [..] 5) printk() from call_console_drivers() Assume the following code: printk() console_unlock() call_console_drivers() ... WARN_ON(1); which now produces: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 305 at kernel/printk/printk.c:1604 call_console_drivers CPU: 2 PID: 305 Comm: bash Call Trace: warn_slowpath_null+0x18/0x1a call_console_drivers.isra.6.constprop.16+0x3a/0xb0 console_unlock+0x471/0x48e vprintk_emit+0x1f4/0x206 vprintk_default+0x18/0x1a vprintk_func+0x6e/0x70 printk+0x3e/0x46 [..] 6) unsupported placeholder in printk() format now prints an actual warning from vscnprintf(), instead of 'BUG: recent printk recursion!'. ------------[ cut here ]------------ WARNING: CPU: 5 PID: 337 at lib/vsprintf.c:1900 format_decode Please remove unsupported % in format string CPU: 5 PID: 337 Comm: bash Call Trace: dump_stack+0x4f/0x65 __warn+0xc2/0xdd warn_slowpath_fmt+0x4b/0x53 format_decode+0x22c/0x308 vsnprintf+0x89/0x3b7 vscnprintf+0xd/0x26 vprintk_emit+0xb4/0x238 vprintk_default+0x1d/0x1f vprintk_func+0x6c/0x73 printk+0x43/0x4b [..] Link: http://lkml.kernel.org/r/20161227141611.940-7-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | | printk: report lost messages in printk safe/nmi contextsSergey Senozhatsky2017-02-083-37/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Account lost messages in pritk-safe and printk-safe-nmi contexts and report those numbers during printk_safe_flush(). The patch also moves lost message counter to struct `printk_safe_seq_buf' instead of having dedicated static counters - this simplifies the code. Link: http://lkml.kernel.org/r/20161227141611.940-6-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
| * | | | printk: always use deferred printk when flush printk_safe linesSergey Senozhatsky2017-02-081-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Always use printk_deferred() in printk_safe_flush_line(). Flushing can be done from NMI or printk_safe contexts (when we are in panic), so we can't call console drivers, yet still want to store the messages in the logbuf buffer. Therefore we use a deferred printk version. Link: http://lkml.kernel.org/r/20170206164253.GA463@tigerII.localdomain Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | | printk: introduce per-cpu safe_print seq bufferSergey Senozhatsky2017-02-084-52/+157
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch extends the idea of NMI per-cpu buffers to regions that may cause recursive printk() calls and possible deadlocks. Namely, printk() can't handle printk calls from schedule code or printk() calls from lock debugging code (spin_dump() for instance); because those may be called with `sem->lock' already taken or any other `critical' locks (p->pi_lock, etc.). An example of deadlock can be vprintk_emit() console_unlock() up() << raw_spin_lock_irqsave(&sem->lock, flags); wake_up_process() try_to_wake_up() ttwu_queue() ttwu_activate() activate_task() enqueue_task() enqueue_task_fair() cfs_rq_of() task_of() WARN_ON_ONCE(!entity_is_task(se)) vprintk_emit() console_trylock() down_trylock() raw_spin_lock_irqsave(&sem->lock, flags) ^^^^ deadlock and some other cases. Just like in NMI implementation, the solution uses a per-cpu `printk_func' pointer to 'redirect' printk() calls to a 'safe' callback, that store messages in a per-cpu buffer and flushes them back to logbuf buffer later. Usage example: printk() printk_safe_enter_irqsave(flags) // // any printk() call from here will endup in vprintk_safe(), // that stores messages in a special per-CPU buffer. // printk_safe_exit_irqrestore(flags) The 'redirection' mechanism, though, has been reworked, as suggested by Petr Mladek. Instead of using a per-cpu @print_func callback we now keep a per-cpu printk-context variable and call either default or nmi vprintk function depending on its value. printk_nmi_entrer/exit and printk_safe_enter/exit, thus, just set/celar corresponding bits in printk-context functions. The patch only adds printk_safe support, we don't use it yet. Link: http://lkml.kernel.org/r/20161227141611.940-4-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
| * | | | printk: rename nmi.c and exported apiSergey Senozhatsky2017-02-084-36/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A preparation patch for printk_safe work. No functional change. - rename nmi.c to print_safe.c - add `printk_safe' prefix to some (which used both by printk-safe and printk-nmi) of the exported functions. Link: http://lkml.kernel.org/r/20161227141611.940-3-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
| * | | | printk: use vprintk_func in vprintk()Sergey Senozhatsky2017-02-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vprintk(), just like printk(), better be using per-cpu printk_func instead of direct vprintk_emit() call. Just in case if vprintk() will ever be called from NMI, or from any other context that can deadlock in printk(). Link: http://lkml.kernel.org/r/20161227141611.940-2-sergey.senozhatsky@gmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Calvin Owens <calvinowens@fb.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Petr Mladek <pmladek@suse.com>
* | | | | Merge tag 'modules-for-v4.11' of ↵Linus Torvalds2017-02-231-13/+17
|\ \ \ \ \ | |_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Summary of modules changes for the 4.11 merge window: - A few small code cleanups - Add modules git tree url to MAINTAINERS" * tag 'modules-for-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: MAINTAINERS: add tree for modules module: fix memory leak on early load_module() failures module: Optimize search_module_extables() modules: mark __inittest/__exittest as __maybe_unused livepatch/module: print notice of TAINT_LIVEPATCH module: Drop redundant declaration of struct module
| * | | | module: fix memory leak on early load_module() failuresLuis R. Rodriguez2017-02-211-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While looking for early possible module loading failures I was able to reproduce a memory leak possible with kmemleak. There are a few rare ways to trigger a failure: o we've run into a failure while processing kernel parameters (parse_args() returns an error) o mod_sysfs_setup() fails o we're a live patch module and copy_module_elf() fails Chances of running into this issue is really low. kmemleak splat: unreferenced object 0xffff9f2c4ada1b00 (size 32): comm "kworker/u16:4", pid 82, jiffies 4294897636 (age 681.816s) hex dump (first 32 bytes): 6d 65 6d 73 74 69 63 6b 30 00 00 00 00 00 00 00 memstick0....... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8c6cfeba>] kmemleak_alloc+0x4a/0xa0 [<ffffffff8c200046>] __kmalloc_track_caller+0x126/0x230 [<ffffffff8c1bc581>] kstrdup+0x31/0x60 [<ffffffff8c1bc5d4>] kstrdup_const+0x24/0x30 [<ffffffff8c3c23aa>] kvasprintf_const+0x7a/0x90 [<ffffffff8c3b5481>] kobject_set_name_vargs+0x21/0x90 [<ffffffff8c4fbdd7>] dev_set_name+0x47/0x50 [<ffffffffc07819e5>] memstick_check+0x95/0x33c [memstick] [<ffffffff8c09c893>] process_one_work+0x1f3/0x4b0 [<ffffffff8c09cb98>] worker_thread+0x48/0x4e0 [<ffffffff8c0a2b79>] kthread+0xc9/0xe0 [<ffffffff8c6dab5f>] ret_from_fork+0x1f/0x40 [<ffffffffffffffff>] 0xffffffffffffffff Cc: stable <stable@vger.kernel.org> # v2.6.30 Fixes: e180a6b7759a ("param: fix charp parameters set via sysfs") Reviewed-by: Miroslav Benes <mbenes@suse.cz> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Signed-off-by: Jessica Yu <jeyu@redhat.com>
| * | | | module: Optimize search_module_extables()Peter Zijlstra2017-02-111-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While looking through the __ex_table stuff I found that we do a linear lookup of the module. Also fix up a comment. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Jessica Yu <jeyu@redhat.com>
| * | | | livepatch/module: print notice of TAINT_LIVEPATCHJoe Lawrence2017-01-311-0/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add back the "tainting kernel with TAINT_LIVEPATCH" kernel log message that commit 2992ef29ae01 ("livepatch/module: make TAINT_LIVEPATCH module-specific") dropped. Now that it's a module-specific taint flag, include the module name. Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: Jessica Yu <jeyu@redhat.com>
* | | | Merge tag 'driver-core-4.11-rc1' of ↵Linus Torvalds2017-02-221-2/+16
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the "small" driver core patches for 4.11-rc1. Not much here, some firmware documentation and self-test updates, a debugfs code formatting issue, and a new feature for call_usermodehelper to make it more robust on systems that want to lock it down in a more secure way. All of these have been linux-next for a while now with no reported issues" * tag 'driver-core-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: kernfs: handle null pointers while printing node name and path Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper() Make static usermode helper binaries constant kmod: make usermodehelper path a const string firmware: revamp firmware documentation selftests: firmware: send expected errors to /dev/null selftests: firmware: only modprobe if driver is missing platform: Print the resource range if device failed to claim kref: prefer atomic_inc_not_zero to atomic_add_unless debugfs: improve formatting of debugfs_real_fops()
| * | | Introduce STATIC_USERMODEHELPER to mediate call_usermodehelper()Greg Kroah-Hartman2017-01-191-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some usermode helper applications are defined at kernel build time, while others can be changed at runtime. To provide a sane way to filter these, add a new kernel option "STATIC_USERMODEHELPER". This option routes all call_usermodehelper() calls through this binary, no matter what the caller wishes to have called. The new binary (by default set to /sbin/usermode-helper, but can be changed through the STATIC_USERMODEHELPER_PATH option) can properly filter the requested programs to be run by the kernel by looking at the first argument that is passed to it. All other options should then be passed onto the proper program if so desired. To disable all call_usermodehelper() calls by the kernel, set STATIC_USERMODEHELPER_PATH to an empty string. Thanks to Neil Brown for the idea of this feature. Cc: NeilBrown <neilb@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | kmod: make usermodehelper path a const stringGreg Kroah-Hartman2017-01-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is in preparation for making it so that usermode helper programs can't be changed, if desired, by userspace. We will tackle the mess of cleaning up the write-ability of argv and env later, that's going to take more work, for much less gain... Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | Merge tag 'arm64-upstream' of ↵Linus Torvalds2017-02-221-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: - Errata workarounds for Qualcomm's Falkor CPU - Qualcomm L2 Cache PMU driver - Qualcomm SMCCC firmware quirk - Support for DEBUG_VIRTUAL - CPU feature detection for userspace via MRS emulation - Preliminary work for the Statistical Profiling Extension - Misc cleanups and non-critical fixes * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (74 commits) arm64/kprobes: consistently handle MRS/MSR with XZR arm64: cpufeature: correctly handle MRS to XZR arm64: traps: correctly handle MRS/MSR with XZR arm64: ptrace: add XZR-safe regs accessors arm64: include asm/assembler.h in entry-ftrace.S arm64: fix warning about swapper_pg_dir overflow arm64: Work around Falkor erratum 1003 arm64: head.S: Enable EL1 (host) access to SPE when entered at EL2 arm64: arch_timer: document Hisilicon erratum 161010101 arm64: use is_vmalloc_addr arm64: use linux/sizes.h for constants arm64: uaccess: consistently check object sizes perf: add qcom l2 cache perf events driver arm64: remove wrong CONFIG_PROC_SYSCTL ifdef ARM: smccc: Update HVC comment to describe new quirk parameter arm64: do not trace atomic operations ACPI/IORT: Fix the error return code in iort_add_smmu_platform_device() ACPI/IORT: Fix iort_node_get_id() mapping entries indexing arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA perf: xgene: Include module.h ...
| * \ \ \ Merge branch 'aarch64/for-next/debug-virtual' into aarch64/for-next/coreWill Deacon2017-01-121-1/+1
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge core DEBUG_VIRTUAL changes from Laura Abbott. Later arm and arm64 support depends on these. * aarch64/for-next/debug-virtual: drivers: firmware: psci: Use __pa_symbol for kernel symbol mm/usercopy: Switch to using lm_alias mm/kasan: Switch to using __pa_symbol and lm_alias kexec: Switch to __pa_symbol mm: Introduce lm_alias mm/cma: Cleanup highmem check lib/Kconfig.debug: Add ARCH_HAS_DEBUG_VIRTUAL
| | * | | | kexec: Switch to __pa_symbolLaura Abbott2017-01-111-1/+1
| | | |/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __pa_symbol is the correct api to get the physical address of kernel symbols. Switch to it to allow for better debug checking. Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
* | | | | Merge tag 'powerpc-4.11-1' of ↵Linus Torvalds2017-02-221-0/+6
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Highlights include: - Support for direct mapped LPC on POWER9, giving Linux direct access to devices that may be on there such as a UART. - Memory hotplug support for the Power9 Radix MMU. - Add new AUX vectors describing the processor's cache geometry, to be used by glibc. - The ability for a guest to ask the hypervisor to resize the guest's hash table, and in addition support for doing so automatically when memory is hotplugged into/out-of the guest. This allows the hash table to be sized based on the current memory usage of the guest, rather than the maximum possible memory usage. - Implementation of optprobes (kprobe optimisation) for powerpc. In addition there's the topic branch shared with the KVM tree, which includes support for guests to use the Radix MMU on Power9. Thanks to: Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T, Anton Blanchard, Benjamin Herrenschmidt, Chris Packham, Daniel Axtens, Daniel Borkmann, David Gibson, Finn Thain, Gautham R. Shenoy, Gavin Shan, Greg Kurz, Joel Stanley, John Allen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Paul Mackerras, Ravi Bangoria, Reza Arbab, Shailendra Singh, Vaibhav Jain, Wei Yongjun" * tag 'powerpc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (129 commits) powerpc/mm/radix: Skip ptesync in pte update helpers powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm powerpc/mm/radix: Update pte update sequence for pte clear case powerpc/mm: Update PROTFAULT handling in the page fault path powerpc/xmon: Fix data-breakpoint powerpc/mm: Fix build break with BOOK3S_64=n and MEMORY_HOTPLUG=y powerpc/mm: Fix build break when CMA=n && SPAPR_TCE_IOMMU=y powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n powerpc/pseries: Fix typo in parameter description powerpc/kprobes: Remove kprobe_exceptions_notify() kprobes: Introduce weak variant of kprobe_exceptions_notify() powerpc/ftrace: Fix confusing help text for DISABLE_MPROFILE_KERNEL powerpc/powernv: Fix opal_exit tracepoint opcode powerpc: Add a prototype for mcount() so it can be versioned powerpc: Drop GPL from of_node_to_nid() export to match other arches powerpc/kprobes: Optimize kprobe in kretprobe_trampoline() powerpc/kprobes: Implement Optprobes powerpc/kprobes: Fixes for kprobe_lookup_name() on BE powerpc: Add helper to check if offset is within relative branch range powerpc/bpf: Introduce __PPC_SH64() ...
| * | | | | kprobes: Introduce weak variant of kprobe_exceptions_notify()Naveen N. Rao2017-02-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kprobe_exceptions_notify() is not used on some of the architectures such as arm[64] and powerpc anymore. Introduce a weak variant for such architectures. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2017-02-2215-151/+1158
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Highlights: 1) Support TX_RING in AF_PACKET TPACKET_V3 mode, from Sowmini Varadhan. 2) Simplify classifier state on sk_buff in order to shrink it a bit. From Willem de Bruijn. 3) Introduce SIPHASH and it's usage for secure sequence numbers and syncookies. From Jason A. Donenfeld. 4) Reduce CPU usage for ICMP replies we are going to limit or suppress, from Jesper Dangaard Brouer. 5) Introduce Shared Memory Communications socket layer, from Ursula Braun. 6) Add RACK loss detection and allow it to actually trigger fast recovery instead of just assisting after other algorithms have triggered it. From Yuchung Cheng. 7) Add xmit_more and BQL support to mvneta driver, from Simon Guinot. 8) skb_cow_data avoidance in esp4 and esp6, from Steffen Klassert. 9) Export MPLS packet stats via netlink, from Robert Shearman. 10) Significantly improve inet port bind conflict handling, especially when an application is restarted and changes it's setting of reuseport. From Josef Bacik. 11) Implement TX batching in vhost_net, from Jason Wang. 12) Extend the dummy device so that VF (virtual function) features, such as configuration, can be more easily tested. From Phil Sutter. 13) Avoid two atomic ops per page on x86 in bnx2x driver, from Eric Dumazet. 14) Add new bpf MAP, implementing a longest prefix match trie. From Daniel Mack. 15) Packet sample offloading support in mlxsw driver, from Yotam Gigi. 16) Add new aquantia driver, from David VomLehn. 17) Add bpf tracepoints, from Daniel Borkmann. 18) Add support for port mirroring to b53 and bcm_sf2 drivers, from Florian Fainelli. 19) Remove custom busy polling in many drivers, it is done in the core networking since 4.5 times. From Eric Dumazet. 20) Support XDP adjust_head in virtio_net, from John Fastabend. 21) Fix several major holes in neighbour entry confirmation, from Julian Anastasov. 22) Add XDP support to bnxt_en driver, from Michael Chan. 23) VXLAN offloads for enic driver, from Govindarajulu Varadarajan. 24) Add IPVTAP driver (IP-VLAN based tap driver) from Sainath Grandhi. 25) Support GRO in IPSEC protocols, from Steffen Klassert" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1764 commits) Revert "ath10k: Search SMBIOS for OEM board file extension" net: socket: fix recvmmsg not returning error from sock_error bnxt_en: use eth_hw_addr_random() bpf: fix unlocking of jited image when module ronx not set arch: add ARCH_HAS_SET_MEMORY config net: napi_watchdog() can use napi_schedule_irqoff() tcp: Revert "tcp: tcp_probe: use spin_lock_bh()" net/hsr: use eth_hw_addr_random() net: mvpp2: enable building on 64-bit platforms net: mvpp2: switch to build_skb() in the RX path net: mvpp2: simplify MVPP2_PRS_RI_* definitions net: mvpp2: fix indentation of MVPP2_EXT_GLOBAL_CTRL_DEFAULT net: mvpp2: remove unused register definitions net: mvpp2: simplify mvpp2_bm_bufs_add() net: mvpp2: drop useless fields in mvpp2_bm_pool and related code net: mvpp2: remove unused 'tx_skb' field of 'struct mvpp2_tx_queue' net: mvpp2: release reference to txq_cpu[] entry after unmapping net: mvpp2: handle too large value in mvpp2_rx_time_coal_set() net: mvpp2: handle too large value handling in mvpp2_rx_pkts_coal_set() net: mvpp2: remove useless arguments in mvpp2_rx_{pkts, time}_coal_set ...
| * \ \ \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-196-26/+13
| |\ \ \ \ \ \
| * | | | | | | bpf: make jited programs visible in tracesDaniel Borkmann2017-02-174-13/+282
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Long standing issue with JITed programs is that stack traces from function tracing check whether a given address is kernel code through {__,}kernel_text_address(), which checks for code in core kernel, modules and dynamically allocated ftrace trampolines. But what is still missing is BPF JITed programs (interpreted programs are not an issue as __bpf_prog_run() will be attributed to them), thus when a stack trace is triggered, the code walking the stack won't see any of the JITed ones. The same for address correlation done from user space via reading /proc/kallsyms. This is read by tools like perf, but the latter is also useful for permanent live tracing with eBPF itself in combination with stack maps when other eBPF types are part of the callchain. See offwaketime example on dumping stack from a map. This work tries to tackle that issue by making the addresses and symbols known to the kernel. The lookup from *kernel_text_address() is implemented through a latched RB tree that can be read under RCU in fast-path that is also shared for symbol/size/offset lookup for a specific given address in kallsyms. The slow-path iteration through all symbols in the seq file done via RCU list, which holds a tiny fraction of all exported ksyms, usually below 0.1 percent. Function symbols are exported as bpf_prog_<tag>, in order to aide debugging and attribution. This facility is currently enabled for root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening is active in any mode. The rationale behind this is that still a lot of systems ship with world read permissions on kallsyms thus addresses should not get suddenly exposed for them. If that situation gets much better in future, we always have the option to change the default on this. Likewise, unprivileged programs are not allowed to add entries there either, but that is less of a concern as most such programs types relevant in this context are for root-only anyway. If enabled, call graphs and stack traces will then show a correct attribution; one example is illustrated below, where the trace is now visible in tooling such as perf script --kallsyms=/proc/kallsyms and friends. Before: 7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux) f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so) After: 7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux) [...] 7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux) 7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux) f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | bpf: remove stubs for cBPF from arch codeDaniel Borkmann2017-02-171-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the dummy bpf_jit_compile() stubs for eBPF JITs and make that a single __weak function in the core that can be overridden similarly to the eBPF one. Also remove stale pr_err() mentions of bpf_jit_compile. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | bpf: mark all registered map/prog types as __ro_after_initDaniel Borkmann2017-02-175-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All map types and prog types are registered to the BPF core through bpf_register_map_type() and bpf_register_prog_type() during init and remain unchanged thereafter. As by design we don't (and never will) have any pluggable code that can register to that at any later point in time, lets mark all the existing bpf_{map,prog}_type_list objects in the tree as __ro_after_init, so they can be moved to read-only section from then onwards. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-176-40/+90
| |\ \ \ \ \ \ \
| * | | | | | | | bpf: reduce compiler warnings by adding fallthrough commentsAlexander Alemayhu2017-02-141-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes the following warnings: kernel/bpf/verifier.c: In function ‘may_access_direct_pkt_data’: kernel/bpf/verifier.c:702:6: warning: this statement may fall through [-Wimplicit-fallthrough=] if (t == BPF_WRITE) ^ kernel/bpf/verifier.c:704:2: note: here case BPF_PROG_TYPE_SCHED_CLS: ^~~~ kernel/bpf/verifier.c: In function ‘reg_set_min_max_inv’: kernel/bpf/verifier.c:2057:23: warning: this statement may fall through [-Wimplicit-fallthrough=] true_reg->min_value = 0; ~~~~~~~~~~~~~~~~~~~~^~~ kernel/bpf/verifier.c:2058:2: note: here case BPF_JSGT: ^~~~ kernel/bpf/verifier.c:2068:23: warning: this statement may fall through [-Wimplicit-fallthrough=] true_reg->min_value = 0; ~~~~~~~~~~~~~~~~~~~~^~~ kernel/bpf/verifier.c:2069:2: note: here case BPF_JSGE: ^~~~ kernel/bpf/verifier.c: In function ‘reg_set_min_max’: kernel/bpf/verifier.c:2009:24: warning: this statement may fall through [-Wimplicit-fallthrough=] false_reg->min_value = 0; ~~~~~~~~~~~~~~~~~~~~~^~~ kernel/bpf/verifier.c:2010:2: note: here case BPF_JSGT: ^~~~ kernel/bpf/verifier.c:2019:24: warning: this statement may fall through [-Wimplicit-fallthrough=] false_reg->min_value = 0; ~~~~~~~~~~~~~~~~~~~~~^~~ kernel/bpf/verifier.c:2020:2: note: here case BPF_JSGE: ^~~~ Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Alexander Alemayhu <alexander@alemayhu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-111-2/+1
| |\ \ \ \ \ \ \ \
| * | | | | | | | | bpf, lpm: fix overflows in trie_alloc checksDaniel Borkmann2017-02-081-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cap the maximum (total) value size and bail out if larger than KMALLOC_MAX_SIZE as otherwise it doesn't make any sense to proceed further, since we're guaranteed to fail to allocate elements anyway in lpm_trie_node_alloc(); likleyhood of failure is still high for large values, though, similarly as with htab case in non-prealloc. Next, make sure that cost vars are really u64 instead of size_t, so that we don't overflow on 32 bit and charge only tiny map.pages against memlock while allowing huge max_entries; cap also the max cost like we do with other map types. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-02-074-66/+102
| |\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The conflict was an interaction between a bug fix in the netvsc driver in 'net' and an optimization of the RX path in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>