summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ceph: decode interval_sets for delegated inosJeff Layton2020-03-302-10/+121
| | | | | | | | | | | | | | | | | Starting in Octopus, the MDS will hand out caps that allow the client to do asynchronous file creates under certain conditions. As part of that, the MDS will delegate ranges of inode numbers to the client. Add the infrastructure to decode these ranges, and stuff them into an xarray for later consumption by the async creation code. Because the xarray code currently only handles unsigned long indexes, and those are 32-bits on 32-bit arches, we only enable the decoding when running on a 64-bit arch. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: make ceph_fill_inode non-staticJeff Layton2020-03-302-23/+32
| | | | | | Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: perform asynchronous unlink if we have sufficient capsJeff Layton2020-03-303-5/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The MDS is getting a new lock-caching facility that will allow it to cache the necessary locks to allow asynchronous directory operations. Since the CEPH_CAP_FILE_* caps are currently unused on directories, we can repurpose those bits for this purpose. When performing an unlink, if we have Fx on the parent directory, and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being removed is the primary link, then then we can fire off an unlink request immediately and don't need to wait on reply before returning. In that situation, just fix up the dcache and link count and return immediately after issuing the call to the MDS. This does mean that we need to hold an extra reference to the inode being unlinked, and extra references to the caps to avoid races. Those references are put and error handling is done in the r_callback routine. If the operation ends up failing, then set a writeback error on the directory inode, and the inode itself that can be fetched later by an fsync on the dir. The behavior of dir caps is slightly different from caps on normal files. Because these are just considered an optimization, if the session is reconnected, we will not automatically reclaim them. They are instead considered lost until we do another synchronous op in the parent directory. Async dirops are enabled via the "nowsync" mount option, which is patterned after the xfs "wsync" mount option. For now, the default is "wsync", but eventually we may flip that. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: don't take refs to want mask unless we have all bitsYan, Zheng2020-03-301-1/+4
| | | | | | | | | If we don't have all of the cap bits for the want mask in try_get_cap_refs, then just take refs on the need bits. Signed-off-by: "Yan, Zheng" <ukernel@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: cap tracking for async directory operationsJeff Layton2020-03-304-14/+56
| | | | | | | | | | | | | | | Track and correctly handle directory caps for asynchronous operations. Add aliases for Frc caps that we now designate at Dcu caps (when dealing with directories). Unlike file caps, we don't reclaim these when the session goes away, and instead preemptively release them. In-flight async dirops are instead handled during reconnect phase. The client needs to re-do a synchronous operation in order to re-get directory caps. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: make __take_cap_refs non-staticJeff Layton2020-03-302-6/+8
| | | | | | | | | Rename it to ceph_take_cap_refs and make it available to other files. Also replace a comment with a lockdep assertion. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add infrastructure for waiting for async create to completeJeff Layton2020-03-305-4/+42
| | | | | | | | | | | | | When we issue an async create, we must ensure that any later on-the-wire requests involving it wait for the create reply. Expand i_ceph_flags to be an unsigned long, and add a new bit that MDS requests can wait on. If the bit is set in the inode when sending caps, then don't send it and just return that it has been delayed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: track primary dentry linkJeff Layton2020-03-304-1/+12
| | | | | | | | | Newer versions of the MDS will flag a dentry as "primary". In later patches, we'll need to consult this info, so track it in di->flags. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add flag to designate that a request is asynchronousJeff Layton2020-03-304-2/+20
| | | | | | | | | | | | | | | | | | ...and ensure that such requests are never queued. The MDS has need to know that a request is asynchronous so add flags and proper infrastructure for that. Also, delegated inode numbers and directory caps are associated with the session, so ensure that async requests are always transmitted on the first attempt and are never queued to wait for session reestablishment. If it does end up looking like we'll need to queue the request, then have it return -EJUKEBOX so the caller can reattempt with a synchronous request. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: more caps.c lockdep assertionsJeff Layton2020-03-301-0/+3
| | | | | | Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: clean up kick_flushing_inode_caps()Jeff Layton2020-03-302-12/+11
| | | | | | | | | | | | | | | The last thing that this function does is release i_ceph_lock, so have the caller do that instead. Add a lockdep assertion to ensure that the function is always called with i_ceph_lock held. Change the prototype to take a ceph_inode_info pointer and drop the separate mdsc argument as we can get that from the session. While at it, make it non-static. We'll need this to kick any flushing caps once the create reply comes in. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: directly skip to the end of redirect replyIlya Dryomov2020-03-301-3/+0
| | | | | | | | Coverity complains about a double write to *p. Don't bother with osd_instructions and directly skip to the end of redirect reply. Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: simplify ceph_monc_handle_map()Ilya Dryomov2020-03-301-4/+4
| | | | | | | | | | ceph_monc_handle_map() confuses static checkers which report a false use-after-free on monc->monmap, missing that monc->monmap and client->monc.monmap is the same pointer. Use monc->monmap consistently and get rid of "old", which is redundant. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: return ETIMEDOUT errno to userland when request timed outXiubo Li2020-03-301-2/+2
| | | | | | | | | | req->r_timeout is only used during mounting, so this error will be more accurate. URL: https://tracker.ceph.com/issues/44215 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: re-org copy_file_range and fix some error pathsLuis Henriques2020-03-301-73/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch re-organizes copy_file_range, trying to fix a few issues in the error handling. Here's the summary: - Abort copy if initial do_splice_direct() returns fewer bytes than requested. - Move the 'size' initialization (with i_size_read()) further down in the code, after the initial call to do_splice_direct(). This avoids issues with a possibly stale value if a manual copy is done. - Move the object copy loop into a separate function. This makes it easier to handle errors (e.g, dirtying caps and updating the MDS metadata if only some objects have been copied before an error has occurred). - Added calls to ceph_oloc_destroy() to avoid leaking memory with src_oloc and dst_oloc - After the object copy loop, the new file size to be reported to the MDS (if there's file size change) is now the actual file size, and not the size after an eventual extra manual copy. - Added a few dout() to show the number of bytes copied in the two manual copies and in the object copy loop. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: move to a dedicated slabcache for mds requestsJeff Layton2020-03-303-2/+12
| | | | | | | | | | On my machine (x86_64) this struct is 952 bytes, which gets rounded up to 1024 by kmalloc. Move this to a dedicated slabcache, so we can allocate them without the extra 72 bytes of overhead per. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: reorganize fields in ceph_mds_requestJeff Layton2020-03-301-3/+3
| | | | | | | This shrinks the struct size by 16 bytes. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: switch to page_mkwrite_check_truncate in ceph_page_mkwriteAndreas Gruenbacher2020-03-301-1/+1
| | | | | | | | | | | Use the "page has been truncated" logic in page_mkwrite_check_truncate instead of reimplementing it here. Other than with the existing code, fail with -EFAULT / VM_FAULT_NOPAGE when page_offset(page) == size here as well, as should be expected. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: replace zero-length array with flexible-array memberGustavo A. R. Silva2020-03-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: enable multiple blk-mq queuesHannes Reinecke2020-03-301-1/+1
| | | | | | | | | Allocate one queue per CPU and get a performance boost from higher parallelism. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: embed image request in blk-mq pduIlya Dryomov2020-03-301-87/+51
| | | | | | | | | | | Avoid making allocations for !IMG_REQ_CHILD image requests. Only IMG_REQ_CHILD image requests need to be freed now. Move the initial request checks to rbd_queue_rq(). Unfortunately we can't fill the image request and kick the state machine directly from rbd_queue_rq() because ->queue_rq() isn't allowed to block. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: acquire header_rwsem just once in rbd_queue_workfn()Ilya Dryomov2020-03-301-28/+31
| | | | | | | | | | Currently header_rwsem is acquired twice: once in rbd_dev_parent_get() when the image request is being created and then in rbd_queue_workfn() to capture mapping_size and snapc. Introduce rbd_img_capture_header() and move image request allocation so that header_rwsem can be acquired just once. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: get rid of img_request_layered_clear()Ilya Dryomov2020-03-301-8/+1
| | | | | | No need to clear IMG_REQ_LAYERED before destroying the request. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: kill img_request krefHannes Reinecke2020-03-301-19/+5
| | | | | | | | | The reference counter is never increased, so we can as well call rbd_img_request_destroy() directly and drop the kref. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: check if file lock exists before sending unlock requestYan, Zheng2020-03-301-2/+29
| | | | | | | | | | When a process exits, kernel closes its files. locks_remove_file() is called to remove file locks on these files. locks_remove_file() tries unlocking files even there is no file lock. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix description of some mount optionsXiubo Li2020-03-301-3/+3
| | | | | | | | | Based on the latest code, the default value for wsize/rsize is 64MB and the default value for the mount_timeout is 60 seconds. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: move ceph_osdc_{read,write}pages to ceph.koXiubo Li2020-03-303-98/+84
| | | | | | | | | Since these helpers are only used by ceph.ko, move them there and rename them with _sync_ qualifiers. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: don't ClearPageChecked in ceph_invalidatepage()Jeff Layton2020-03-301-2/+0
| | | | | | | | | | CephFS doesn't set this bit to begin with, so there should be no need to clear it. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* rbd: remove barriers from img_request_layered_{set,clear,test}()Ilya Dryomov2020-03-301-3/+0
| | | | | | | | IMG_REQ_LAYERED is set in rbd_img_request_create(), and tested and cleared in rbd_img_request_destroy() when the image request is about to be destroyed. The barriers are unnecessary. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: drop CEPH_DEFINE_SHOW_FUNCIlya Dryomov2020-03-303-32/+18
| | | | | | | | Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
* ceph: check inode type for CEPH_CAP_FILE_{CACHE,RD,REXTEND,LAZYIO}Yan, Zheng2020-03-303-24/+35
| | | | | | | | These bits will have new meaning for directory inodes. Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add refcounting for Fx capsJeff Layton2020-03-303-1/+9
| | | | | | | | | In future patches we'll be taking and relying on Fx caps. Add proper refcounting for them. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: register MDS request with dir inode from the startJeff Layton2020-03-301-10/+6
| | | | | | | | | | | | | | | | | | When the unsafe reply to a request comes in, the request is put on the r_unsafe_dir inode's list. In future patches, we're going to need to wait on requests that may not have gotten an unsafe reply yet. Change __register_request to put the entry on the dir inode's list when the pointer is set in the request, and don't check the CEPH_MDS_R_GOT_UNSAFE flag when unregistering it. The only place that uses this list today is fsync codepath, and with the coming changes, we'll want to wait on all operations whether it has gotten an unsafe reply or not. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Linux 5.6v5.6Linus Torvalds2020-03-301-1/+1
|
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2020-03-297-45/+82
|\ | | | | | | | | | | | | | | | | | | | | | | Merge vm fixes from Andrew Morton: "5 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/sparse: fix kernel crash with pfn_section_valid check mm: fork: fix kernel_stack memcg stats for various stack implementations hugetlb_cgroup: fix illegal access to memory drivers/base/memory.c: indicate all memory blocks as removable mm/swapfile.c: move inode_lock out of claim_swapfile
| * mm/sparse: fix kernel crash with pfn_section_valid checkAneesh Kumar K.V2020-03-291-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the crash like this: BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc000000000c3447c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1 ... NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0 LR [c000000000088354] vmemmap_free+0x144/0x320 Call Trace: section_deactivate+0x220/0x240 __remove_pages+0x118/0x170 arch_remove_memory+0x3c/0x150 memunmap_pages+0x1cc/0x2f0 devm_action_release+0x30/0x50 release_nodes+0x2f8/0x3e0 device_release_driver_internal+0x168/0x270 unbind_store+0x130/0x170 drv_attr_store+0x44/0x60 sysfs_kf_write+0x68/0x80 kernfs_fop_write+0x100/0x290 __vfs_write+0x3c/0x70 vfs_write+0xcc/0x240 ksys_write+0x7c/0x140 system_call+0x5c/0x68 The crash is due to NULL dereference at test_bit(idx, ms->usage->subsection_map); due to ms->usage = NULL in pfn_section_valid() With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after depopulate_section_mem(). This was done so that pfn_page() can work correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that config pfn_to_page does __section_mem_map_addr(__sec) + __pfn; where static inline struct page *__section_mem_map_addr(struct mem_section *section) { unsigned long map = section->section_mem_map; map &= SECTION_MAP_MASK; return (struct page *)map; } Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is used to check the pfn validity (pfn_valid()). Since section_deactivate release mem_section->usage if a section is fully deactivated, pfn_valid() check after a subsection_deactivate cause a kernel crash. static inline int pfn_valid(unsigned long pfn) { ... return early_section(ms) || pfn_section_valid(ms, pfn); } where static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) { int idx = subsection_map_index(pfn); return test_bit(idx, ms->usage->subsection_map); } Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is freed. For architectures like ppc64 where large pages are used for vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple sections. Hence before a vmemmap mapping page can be freed, the kernel needs to make sure there are no valid sections within that mapping. Clearing the section valid bit before depopulate_section_memap enables this. [aneesh.kumar@linux.ibm.com: add comment] Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case") Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm: fork: fix kernel_stack memcg stats for various stack implementationsRoman Gushchin2020-03-293-2/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node(). In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup. It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration). In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c . Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations [guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * hugetlb_cgroup: fix illegal access to memoryMina Almasry2020-03-291-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This appears to be a mistake in commit faced7e0806cf ("mm: hugetlb controller for cgroups v2"). Essentially that commit does a hugetlb_cgroup_from_counter assuming that page_counter_try_charge has initialized counter. But if that has failed then it seems will not initialize counter, so hugetlb_cgroup_from_counter(counter) ends up pointing to random memory, causing kasan to complain. The solution is to simply use 'h_cg', instead of hugetlb_cgroup_from_counter(counter), since that is a reference to the hugetlb_cgroup anyway. After this change kasan ceases to complain. Fixes: faced7e0806cf ("mm: hugetlb controller for cgroups v2") Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Giuseppe Scrivano <gscrivan@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * drivers/base/memory.c: indicate all memory blocks as removableDavid Hildenbrand2020-03-291-20/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We see multiple issues with the implementation/interface to compute whether a memory block can be offlined (exposed via /sys/devices/system/memory/memoryX/removable) and would like to simplify it (remove the implementation). 1. It runs basically lockless. While this might be good for performance, we see possible races with memory offlining that will require at least some sort of locking to fix. 2. Nowadays, more false positives are possible. No arch-specific checks are performed that validate if memory offlining will not be denied right away (and such check will require locking). For example, arm64 won't allow to offline any memory block that was added during boot - which will imply a very high error rate. Other archs have other constraints. 3. The interface is inherently racy. E.g., if a memory block is detected to be removable (and was not a false positive at that time), there is still no guarantee that offlining will actually succeed. So any caller already has to deal with false positives. 4. It is unclear which performance benefit this interface actually provides. The introducing commit 5c755e9fd813 ("memory-hotplug: add sysfs removable attribute for hotplug memory remove") mentioned "A user-level agent must be able to identify which sections of memory are likely to be removable before attempting the potentially expensive operation." However, no actual performance comparison was included. Known users: - lsmem: Will group memory blocks based on the "removable" property. [1] - chmem: Indirect user. It has a RANGE mode where one can specify removable ranges identified via lsmem to be offlined. However, it also has a "SIZE" mode, which allows a sysadmin to skip the manual "identify removable blocks" step. [2] - powerpc-utils: Uses the "removable" attribute to skip some memory blocks right away when trying to find some to offline+remove. However, with ballooning enabled, it already skips this information completely (because it once resulted in many false negatives). Therefore, the implementation can deal with false positives properly already. [3] According to Nathan Fontenot, DLPAR on powerpc is nowadays no longer driven from userspace via the drmgr command (powerpc-utils). Nowadays it's managed in the kernel - including onlining/offlining of memory blocks - triggered by drmgr writing to /sys/kernel/dlpar. So the affected legacy userspace handling is only active on old kernels. Only very old versions of drmgr on a new kernel (unlikely) might execute slower - totally acceptable. With CONFIG_MEMORY_HOTREMOVE, always indicating "removable" should not break any user space tool. We implement a very bad heuristic now. Without CONFIG_MEMORY_HOTREMOVE we cannot offline anything, so report "not removable" as before. Original discussion can be found in [4] ("[PATCH RFC v1] mm: is_mem_section_removable() overhaul"). Other users of is_mem_section_removable() will be removed next, so that we can remove is_mem_section_removable() completely. [1] http://man7.org/linux/man-pages/man1/lsmem.1.html [2] http://man7.org/linux/man-pages/man8/chmem.8.html [3] https://github.com/ibm-power-utilities/powerpc-utils [4] https://lkml.kernel.org/r/20200117105759.27905-1-david@redhat.com Also, this patch probably fixes a crash reported by Steve. http://lkml.kernel.org/r/CAPcyv4jpdaNvJ67SkjyUJLBnBnXXQv686BiVW042g03FUmWLXw@mail.gmail.com Reported-by: "Scargall, Steve" <steve.scargall@intel.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Nathan Fontenot <ndfont@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Karel Zak <kzak@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200128093542.6908-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * mm/swapfile.c: move inode_lock out of claim_swapfileNaohiro Aota2020-03-291-21/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | claim_swapfile() currently keeps the inode locked when it is successful, or the file is already swapfile (with -EBUSY). And, on the other error cases, it does not lock the inode. This inconsistency of the lock state and return value is quite confusing and actually causing a bad unlock balance as below in the "bad_swap" section of __do_sys_swapon(). This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE check out of claim_swapfile(). The inode is unlocked in "bad_swap_unlock_inode" section, so that the inode is ensured to be unlocked at "bad_swap". Thus, error handling codes after the locking now jumps to "bad_swap_unlock_inode" instead of "bad_swap". ===================================== WARNING: bad unlock balance detected! 5.5.0-rc7+ #176 Not tainted ------------------------------------- swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550 but there are no more locks to release! other info that might help us debug this: no locks held by swapon/4294. stack backtrace: CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176 Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014 Call Trace: dump_stack+0xa1/0xea print_unlock_imbalance_bug.cold+0x114/0x123 lock_release+0x562/0xed0 up_write+0x2d/0x490 __do_sys_swapon+0x94b/0x3550 __x64_sys_swapon+0x54/0x80 do_syscall_64+0xa4/0x4b0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f15da0a0dc7 Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Qais Youef <qais.yousef@arm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'timers-urgent-2020-03-29' of ↵Linus Torvalds2020-03-291-2/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single fix for the Hyper-V clocksource driver to make sched clock actually return nanoseconds and not the virtual clock value which increments at 10e7 HZ (100ns)" * tag 'timers-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctly
| * | clocksource/drivers/hyper-v: Make sched clock return nanoseconds correctlyYubo Xie2020-03-271-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sched clock read functions return the HV clock (100ns granularity) without converting it to nanoseconds. Add the missing conversion. Fixes: bd00cd52d5be ("clocksource/drivers/hyperv: Add Hyper-V specific sched clock function") Signed-off-by: Yubo Xie <yuboxie@microsoft.com> Signed-off-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200327021159.31429-1-Tianyu.Lan@microsoft.com
* | | Merge tag 'irq-urgent-2020-03-29' of ↵Linus Torvalds2020-03-291-2/+9
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single bugfix to prevent reference leaks in irq affinity notifiers" * tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix reference leaks on irq affinity notifiers
| * | | genirq: Fix reference leaks on irq affinity notifiersEdward Cree2020-03-211-2/+9
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The handling of notify->work did not properly maintain notify->kref in two cases: 1) where the work was already scheduled, another irq_set_affinity_locked() would get the ref and (no-op-ly) schedule the work. Thus when irq_affinity_notify() ran, it would drop the original ref but not the additional one. 2) when cancelling the (old) work in irq_set_affinity_notifier(), if there was outstanding work a ref had been got for it but was never put. Fix both by checking the return values of the work handling functions (schedule_work() for (1) and cancel_work_sync() for (2)) and put the extra ref if the return value indicates preexisting work. Fixes: cd7eab44e994 ("genirq: Add IRQ affinity notifiers") Fixes: 59c39840f5ab ("genirq: Prevent use-after-free and work list corruption") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ben Hutchings <ben@decadent.org.uk> Link: https://lkml.kernel.org/r/24f5983f-2ab5-e83a-44ee-a45b5f9300f5@solarflare.com
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2020-03-2923-80/+221
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix memory leak in vti6, from Torsten Hilbrich. 2) Fix double free in xfrm_policy_timer, from YueHaibing. 3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from Johannes Berg. 4) Wrong allocation failure check in qlcnic driver, from Xu Wang. 5) Get ks8851-ml IO operations right, for real this time, from Marek Vasut. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits) r8169: fix PHY driver check on platforms w/o module softdeps net: ks8851-ml: Fix IO operations, again mlxsw: spectrum_mr: Fix list iteration in error path qlcnic: Fix bad kzalloc null test mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX mac80211: mark station unauthorized before key removal mac80211: Check port authorization in the ieee80211_tx_dequeue() case cfg80211: Do not warn on same channel at the end of CSA mac80211: drop data frames without key on encrypted links ieee80211: fix HE SPR size calculation nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type xfrm: policy: Fix doulbe free in xfrm_policy_timer bpf: Explicitly memset some bpf info structures declared on the stack bpf: Explicitly memset the bpf_attr structure bpf: Sanitize the bpf_struct_ops tcp-cc name vti6: Fix memory leak of skb if input policy check fails esp: remove the skb from the chain when it's enqueued in cryptd_wq ipv6: xfrm6_tunnel.c: Use built-in RCU list checking xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire xfrm: fix uctx len check in verify_sec_ctx_len ...
| * \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2020-03-284-20/+25
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2020-03-27 The following pull-request contains BPF updates for your *net* tree. We've added 3 non-merge commits during the last 4 day(s) which contain a total of 4 files changed, 25 insertions(+), 20 deletions(-). The main changes are: 1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid having to rely on compiler to do so. Issues have been noticed on some compilers with padding and other oddities where the request was then unexpectedly rejected, from Greg Kroah-Hartman. 2) Sanitize the bpf_struct_ops TCP congestion control name in order to avoid problematic characters such as whitespaces, from Martin KaFai Lau. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | bpf: Explicitly memset some bpf info structures declared on the stackGreg Kroah-Hartman2020-03-202-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trying to initialize a structure with "= {};" will not always clean out all padding locations in a structure. So be explicit and call memset to initialize everything for a number of bpf information structures that are then copied from userspace, sometimes from smaller memory locations than the size of the structure. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320162258.GA794295@kroah.com
| | * | | bpf: Explicitly memset the bpf_attr structureGreg Kroah-Hartman2020-03-201-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the bpf syscall, we are relying on the compiler to properly zero out the bpf_attr union that we copy userspace data into. Unfortunately that doesn't always work properly, padding and other oddities might not be correctly zeroed, and in some tests odd things have been found when the stack is pre-initialized to other values. Fix this by explicitly memsetting the structure to 0 before using it. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: Alistair Delva <adelva@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://android-review.googlesource.com/c/kernel/common/+/1235490 Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com
| | * | | bpf: Sanitize the bpf_struct_ops tcp-cc nameMartin KaFai Lau2020-03-173-16/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bpf_struct_ops tcp-cc name should be sanitized in order to avoid problematic chars (e.g. whitespaces). This patch reuses the bpf_obj_name_cpy() for accepting the same set of characters in order to keep a consistent bpf programming experience. A "size" param is added. Also, the strlen is returned on success so that the caller (like the bpf_tcp_ca here) can error out on empty name. The existing callers of the bpf_obj_name_cpy() only need to change the testing statement to "if (err < 0)". For all these existing callers, the err will be overwritten later, so no extra change is needed for the new strlen return value. v3: - reverse xmas tree style v2: - Save the orig_src to avoid "end - size" (Andrii) Fixes: 0baf26b0fcd7 ("bpf: tcp: Support tcp_congestion_ops in bpf") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200314010209.1131542-1-kafai@fb.com
| * | | | r8169: fix PHY driver check on platforms w/o module softdepsHeiner Kallweit2020-03-271-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On Android/x86 the module loading infrastructure can't deal with softdeps. Therefore the check for presence of the Realtek PHY driver module fails. mdiobus_register() will try to load the PHY driver module, therefore move the check to after this call and explicitly check that a dedicated PHY driver is bound to the PHY device. Fixes: f32593773549 ("r8169: check that Realtek PHY driver module is loaded") Reported-by: Chih-Wei Huang <cwhuang@android-x86.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>