linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	mm: fix locking order in mm_take_all_locks()	Kirill A. Shutemov	2016-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dmitry Vyukov has reported[1] possible deadlock (triggered by his syzkaller fuzzer): Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&hugetlbfs_i_mmap_rwsem_key); lock(&mapping->i_mmap_rwsem); lock(&hugetlbfs_i_mmap_rwsem_key); lock(&mapping->i_mmap_rwsem); Both traces points to mm_take_all_locks() as a source of the problem. It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem. huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key and allocator can take i_mmap_rwsem if it hit reclaim. So we need to take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from rest of VMAs. The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key. [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	cgroup, memcg, writeback: drop spurious rcu locking around ↵	Tejun Heo	2016-01-16	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mem_cgroup_css_from_page() In earlier versions, mem_cgroup_css_from_page() could return non-root css on a legacy hierarchy which can go away and required rcu locking; however, the eventual version simply returns the root cgroup if memcg is on a legacy hierarchy and thus doesn't need rcu locking around or in it. Remove spurious rcu lockings. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	dax: re-enable dax pmd mappings	Dan Williams	2016-01-16	2	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	Now that the get_user_pages() path knows how to handle dax-pmd mappings, remove the protections that disabled dax-pmd support. Tests available from github.com/pmem/ndctl: make TESTS="lib/test-dax.sh lib/test-mmap.sh" check Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	dax: provide diagnostics for pmd mapping failures	Dan Williams	2016-01-16	1	-8/+57
\| \| \| \| \| \| \| \| \| \| \| \|	There is a wide gamut of conditions that can trigger the dax pmd path to fallback to pte mappings. Ideally we'd have a syscall interface to determine mapping characteristics after the fact. In the meantime provide debug messages. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm, dax: convert vmf_insert_pfn_pmd() to pfn_t	Dan Williams	2016-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Similar to the conversion of vm_insert_mixed() use pfn_t in the vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the pfn is backed by a devm_memremap_pages() mapping. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm, dax, gpu: convert vm_insert_mixed to pfn_t	Dan Williams	2016-01-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers _PAGE_DEVMAP to be set in the resulting pte. There are no functional changes to the gpu drivers as a result of this conversion. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave@sr71.net> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm, dax, pmem: introduce pfn_t	Dan Williams	2016-01-16	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For the purpose of communicating the optional presence of a 'struct page' for the pfn returned from ->direct_access(), introduce a type that encapsulates a page-frame-number plus flags. These flags contain the historical "page_link" encoding for a scatterlist entry, but can also denote "device memory". Where "device memory" is a set of pfns that are not part of the kernel's linear mapping by default, but are accessed via the same memory controller as ram. The motivation for this new type is large capacity persistent memory that needs struct page entries in the 'memmap' to support 3rd party DMA (i.e. O_DIRECT I/O with a persistent memory source/target). However, we also need it in support of maintaining a list of mapped inodes which need to be unmapped at driver teardown or freeze_bdev() time. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Hansen <dave@sr71.net> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	dax: Split pmd map when fallback on COW	Toshi Kani	2016-01-16	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An infinite loop of PMD faults was observed when attempted to mlock() a private read-only PMD mmap'd range of a DAX file. __dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when falling back to PTE on COW. However, __handle_mm_fault() returns without falling back to handle_pte_fault() because a PMD map is present in this case. Change __dax_pmd_fault() to split the PMD map, if present, before returning with VM_FAULT_FALLBACK. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()	Dan Williams	2016-01-16	2	-86/+135
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The DAX implementation needs to protect new calls to ->direct_access() and usage of its return value against the driver for the underlying block device being disabled. Use blk_queue_enter()/blk_queue_exit() to hold off blk_cleanup_queue() from proceeding, or otherwise fail new mapping requests if the request_queue is being torn down. This also introduces blk_dax_ctl to simplify the interface from fs/dax.c through dax_map_atomic() to bdev_direct_access(). [willy@linux.intel.com: fix read() of a hole] Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	dax: guarantee page aligned results from bdev_direct_access()	Dan Williams	2016-01-16	2	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	If a ->direct_access() implementation ever returns a map count less than PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies error checking in upper layers. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	dax: increase granularity of dax_clear_blocks() operations	Dan Williams	2016-01-16	1	-14/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dax_clear_blocks is currently performing a cond_resched() after every PAGE_SIZE memset. We need not check so frequently, for example md-raid only calls cond_resched() at stripe granularity. Also, in preparation for introducing a dax_map_atomic() operation that temporarily pins a dax mapping move the call to cond_resched() to the outer loop. The worst case latency between calls to cond_resched() after this change is 500us the average latency is 133us. This is up from a 10us max and 4us average. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pmem, dax: clean up clear_pmem()	Dan Williams	2016-01-16	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To date, we have implemented two I/O usage models for persistent memory, PMEM (a persistent "ram disk") and DAX (mmap persistent memory into userspace). This series adds a third, DAX-GUP, that allows DAX mappings to be the target of direct-i/o. It allows userspace to coordinate DMA/RDMA from/to persistent memory. The implementation leverages the ZONE_DEVICE mm-zone that went into 4.3-rc1 (also discussed at kernel summit) to flag pages that are owned and dynamically mapped by a device driver. The pmem driver, after mapping a persistent memory range into the system memmap via devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus page-backed pmem-pfns via flags in the new pfn_t type. The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the resulting pte(s) inserted into the process page tables with a new _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys off _PAGE_DEVMAP to pin the device hosting the page range active. Finally, get_page() and put_page() are modified to take references against the device driver established page mapping. Finally, this need for "struct page" for persistent memory requires memory capacity to store the memmap array. Given the memmap array for a large pool of persistent may exhaust available DRAM introduce a mechanism to allocate the memmap from persistent memory. The new "struct vmem_altmap *" parameter to devm_memremap_pages() enables arch_add_memory() to use reserved pmem capacity rather than the page allocator. This patch (of 25): Both __dax_pmd_fault, and clear_pmem() were taking special steps to clear memory a page at a time to take advantage of non-temporal clear_page() implementations. However, x86_64 does not use non-temporal instructions for clear_page(), and arch_clear_pmem() was always incurring the cost of __arch_wb_cache_pmem(). Clean up the assumption that doing clear_pmem() a page at a time is more performant. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: David Airlie <airlied@linux.ie> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: differentiate page_mapped() from page_mapcount() for compound pages	Kirill A. Shutemov	2016-01-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Let's define page_mapped() to be true for compound pages if any sub-pages of the compound page is mapped (with PMD or PTE). On other hand page_mapcount() return mapcount for this particular small page. This will make cases like page_get_anon_vma() behave correctly once we allow huge pages to be mapped with PTE. Most users outside core-mm should use page_mapcount() instead of page_mapped(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm, thp: remove infrastructure for handling splitting PMDs	Kirill A. Shutemov	2016-01-16	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With new refcounting we don't need to mark PMDs splitting. Let's drop code to handle this. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm, proc: adjust PSS calculation	Kirill A. Shutemov	2016-01-16	1	-16/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal of this patchset is to make refcounting on THP pages cheaper with simpler semantics and allow the same THP compound page to be mapped with PMD and PTEs. This is required to get reasonable THP-pagecache implementation. With the new refcounting design it's much easier to protect against split_huge_page(): simple reference on a page will make you the deal. It makes gup_fast() implementation simpler and doesn't require special-case in futex code to handle tail THP pages. It should improve THP utilization over the system since splitting THP in one process doesn't necessary lead to splitting the page in all other processes have the page mapped. The patchset drastically lower complexity of get_page()/put_page() codepaths. I encourage people look on this code before-and-after to justify time budget on reviewing this patchset. This patch (of 37): With new refcounting all subpages of the compound page are not necessary have the same mapcount. We need to take into account mapcount of every sub-page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	page-flags: define PG_locked behavior on compound pages	Kirill A. Shutemov	2016-01-16	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	lock_page() must operate on the whole compound page. It doesn't make much sense to lock part of compound page. Change code to use head page's PG_locked, if tail page is passed. This patch also gets rid of custom helper functions -- __set_page_locked() and __clear_page_locked(). They are replaced with helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these helper would trigger VM_BUG_ON(). SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never appear there. VM_BUG_ON() is added to make sure that this assumption is correct. [akpm@linux-foundation.org: fix fs/cifs/file.c] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux	Linus Torvalds	2016-01-15	12	-47/+310
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull nfsd updates from Bruce Fields: "Smaller bugfixes and cleanup, including a fix for a failures of kerberized NFSv4.1 mounts, and Scott Mayhew's work addressing ACK storms that can affect some high-availability NFS setups" * tag 'nfsd-4.5' of git://linux-nfs.org/~bfields/linux: nfsd: add new io class tracepoint nfsd: give up on CB_LAYOUTRECALLs after two lease periods nfsd: Fix nfsd leaks sunrpc module references lockd: constify nlmsvc_binding structure lockd: use to_delayed_work nfsd: use to_delayed_work Revert "svcrdma: Do not send XDR roundup bytes for a write chunk" lockd: Register callbacks on the inetaddr_chain and inet6addr_chain nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain sunrpc: Add a function to close temporary transports immediately nfsd: don't base cl_cb_status on stale information nfsd4: fix gss-proxy 4.1 mounts for some AD principals nfsd: fix unlikely NULL deref in mach_creds_match nfsd: minor consolidation of mach_cred handling code nfsd: helper for dup of possibly NULL string svcrpc: move some initialization to common code nfsd: fix a warning message nfsd: constify nfsd4_callback_ops structure nfsd: recover: constify nfsd4_client_tracking_ops structures svcrdma: Do not send XDR roundup bytes for a write chunk
\| *	nfsd: add new io class tracepoint	Jeff Layton	2016-01-14	3	-0/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add some new tracepoints in the nfsd read/write codepaths. The idea is that this will give us the ability to measure how long each phase of a read or write operation takes. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: give up on CB_LAYOUTRECALLs after two lease periods	Jeff Layton	2016-01-08	1	-10/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Have the CB_LAYOUTRECALL code treat NFS4_OK and NFS4ERR_DELAY returns equivalently. Change the code to periodically resend CB_LAYOUTRECALLS until the ls_layouts list is empty or the client returns a different error code. If we go for two lease periods without the list being emptied or the client sending a hard error, then we give up and clean out the list anyway. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: Fix nfsd leaks sunrpc module references	Kinglong Mee	2016-01-07	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Stefan Hajnoczi reports, nfsd leaks 3 references to the sunrpc module here: # echo -n "asdf 1234" >/proc/fs/nfsd/portlist bash: echo: write error: Protocol not supported Now stop nfsd and try unloading the kernel modules: # systemctl stop nfs-server # systemctl stop nfs # systemctl stop proc-fs-nfsd.mount # systemctl stop var-lib-nfs-rpc_pipefs.mount # rmmod nfsd # rmmod nfs_acl # rmmod lockd # rmmod auth_rpcgss # rmmod sunrpc rmmod: ERROR: Module sunrpc is in use # lsmod \| grep rpc sunrpc 315392 3 It is caused by nfsd don't cleanup rpcb program for nfsd when destroying svc service after creating xprt fail. Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	lockd: constify nlmsvc_binding structure	Julia Lawall	2016-01-07	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The nlmsvc_binding structure is never modified, so declare it as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	lockd: use to_delayed_work	Geliang Tang	2016-01-07	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use to_delayed_work() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: use to_delayed_work	Geliang Tang	2016-01-07	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use to_delayed_work() instead of open-coding it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	lockd: Register callbacks on the inetaddr_chain and inet6addr_chain	Scott Mayhew	2015-12-23	1	-2/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Register callbacks on inetaddr_chain and inet6addr_chain to trigger cleanup of lockd transport sockets when an ip address is deleted. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain	Scott Mayhew	2015-12-23	1	-0/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Register callbacks on inetaddr_chain and inet6addr_chain to trigger cleanup of nfsd transport sockets when an ip address is deleted. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: don't base cl_cb_status on stale information	J. Bruce Fields	2015-12-23	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The rpc client we use to send callbacks may change occasionally. (In the 4.0 case, the client can use setclientid/setclientid_confirm to update the callback parameters. In the 4.1+ case, sessions and connections can come and go.) The update is done from the callback thread by nfsd4_process_cb_update, which shuts down the old rpc client and then creates a new one. The client shutdown kills any ongoing rpc calls. There won't be any new ones till the new one's created and the callback thread moves on. When an rpc encounters a problem that may suggest callback rpc's aren't working any longer, it normally sets NFSD4_CB_DOWN, so the server can tell the client something's wrong. But if the rpc notices CB_UPDATE is set, then the failure may just be a normal result of shutting down the callback client. Or it could just be a coincidence, but in any case, it means we're runing with the old about-to-be-discarded client, so the failure's not interesting. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd4: fix gss-proxy 4.1 mounts for some AD principals	J. Bruce Fields	2015-11-24	1	-1/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The principal name on a gss cred is used to setup the NFSv4.0 callback, which has to have a client principal name to authenticate to. That code wants the name to be in the form servicetype@hostname. rpc.svcgssd passes down such names (and passes down no principal name at all in the case the principal isn't a service principal). gss-proxy always passes down the principal name, and passes it down in the form servicetype/hostname@REALM. So we've been munging the name gss-proxy passes down into the format the NFSv4.0 callback code expects, or throwing away the name if we can't. Since the introduction of the MACH_CRED enforcement in NFSv4.1, we've also been using the principal name to verify that certain operations are done as the same principal as was used on the original EXCHANGE_ID call. For that application, the original name passed down by gss-proxy is also useful. Lack of that name in some cases was causing some kerberized NFSv4.1 mount failures in an Active Directory environment. This fix only works in the gss-proxy case. The fix for legacy rpc.svcgssd would be more involved, and rpc.svcgssd already has other problems in the AD case. Reported-and-tested-by: James Ralston <ralston@pobox.com> Acked-by: Simo Sorce <simo@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: fix unlikely NULL deref in mach_creds_match	J. Bruce Fields	2015-11-24	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We really shouldn't allow a client to be created with cl_mach_cred set unless it also has a principal name. This also allows us to fail such cases immediately on EXCHANGE_ID as opposed to waiting and incorrectly returning WRONG_CRED on the following CREATE_SESSION. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: minor consolidation of mach_cred handling code	J. Bruce Fields	2015-11-24	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \|	Minor cleanup, no change in functionality. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: helper for dup of possibly NULL string	J. Bruce Fields	2015-11-24	1	-6/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Technically the initialization in the NULL case isn't even needed as the only caller already has target zeroed out, but it seems safer to keep copy_cred generic. Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: fix a warning message	Dan Carpenter	2015-11-23	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The WARN() macro takes a condition and a format string. The condition was accidentally left out here so it just prints the function name instead of the message. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: constify nfsd4_callback_ops structure	Julia Lawall	2015-11-23	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The nfsd4_callback_ops structure is never modified, so declare it as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
\| *	nfsd: recover: constify nfsd4_client_tracking_ops structures	Julia Lawall	2015-11-23	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The nfsd4_client_tracking_ops structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* \|	Merge branch 'for_linus' of ↵	Linus Torvalds	2016-01-15	10	-200/+195
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fixes and quota cleanups from Jan Kara: "Several UDF fixes and some minor quota cleanups" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Check output buffer length when converting name to CS0 udf: Prevent buffer overrun with multi-byte characters quota: constify qtree_fmt_operations structures udf: avoid uninitialized variable use udf: Fix lost indirect extent block udf: Factor out code for creating indirect extent udf: limit the maximum number of indirect extents in a row udf: limit the maximum number of TD redirections fs: make quota/dquot.c explicitly non-modular fs: make quota/netlink.c explicitly non-modular
\| * \|	udf: Check output buffer length when converting name to CS0	Andrew Gabbasov	2016-01-04	1	-2/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a name contains at least some characters with Unicode values exceeding single byte, the CS0 output should have 2 bytes per character. And if other input characters have single byte Unicode values, then the single input byte is converted to 2 output bytes, and the length of output becomes larger than the length of input. And if the input name is long enough, the output length may exceed the allocated buffer length. All this means that conversion from UTF8 or NLS to CS0 requires checking of output length in order to stop when it exceeds the given output buffer size. [JK: Make code return -ENAMETOOLONG instead of silently truncating the name] CC: stable@vger.kernel.org Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: Prevent buffer overrun with multi-byte characters	Andrew Gabbasov	2016-01-04	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udf_CS0toUTF8 function stops the conversion when the output buffer length reaches UDF_NAME_LEN-2, which is correct maximum name length, but, when checking, it leaves the space for a single byte only, while multi-bytes output characters can take more space, causing buffer overflow. Similar error exists in udf_CS0toNLS function, that restricts the output length to UDF_NAME_LEN, while actual maximum allowed length is UDF_NAME_LEN-2. In these cases the output can override not only the current buffer length field, causing corruption of the name buffer itself, but also following allocation structures, causing kernel crash. Adjust the output length checks in both functions to prevent buffer overruns in case of multi-bytes UTF8 or NLS characters. CC: stable@vger.kernel.org Signed-off-by: Andrew Gabbasov <andrew_gabbasov@mentor.com> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	quota: constify qtree_fmt_operations structures	Julia Lawall	2016-01-04	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The qtree_fmt_operations structures are never modified, so declare them as const. Done with the help of Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: avoid uninitialized variable use	Arnd Bergmann	2016-01-04	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A new warning has come up from a recent cleanup: fs/udf/inode.c: In function 'udf_setup_indirect_aext': fs/udf/inode.c:1927:28: warning: 'adsize' may be used uninitialized in this function [-Wmaybe-uninitialized] If the alloc_type is neither ICBTAG_FLAG_AD_SHORT nor ICBTAG_FLAG_AD_LONG, the value of adsize is undefined. Currently, callers of these functions make sure alloc_type is one of the two valid ones but for future proofing make sure we handle the case of invalid alloc type as well. This changes the code to return -EIOin that case. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: fcea62babc81 ("udf: Factor out code for creating indirect extent") Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: Fix lost indirect extent block	Jan Kara	2015-12-23	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When inode ends with empty indirect extent block and we extended that file, udf_do_extend_file() ended up just overwriting pointer to it with another extent and thus effectively leaking the block and also corruptiong length of allocation descriptors. Fix the problem by properly following into next indirect extent when it is present. Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: Factor out code for creating indirect extent	Jan Kara	2015-12-23	3	-189/+130
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Factor out code for creating indirect extent from udf_add_aext(). It was mostly duplicated in two places. Also remove some opencoded versions of udf_write_aext(). Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: limit the maximum number of indirect extents in a row	Vegard Nossum	2015-12-23	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	udf_next_aext() just follows extent pointers while extents are marked as indirect. This can loop forever for corrupted filesystem. Limit number the of indirect extents we are willing to follow in a row. [JK: Updated changelog, limit, style] Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: stable@vger.kernel.org Cc: Jan Kara <jack@suse.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: limit the maximum number of TD redirections	Vegard Nossum	2015-12-14	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Filesystem fuzzing revealed that we could get stuck in the udf_process_sequence() loop. The maximum limit was chosen arbitrarily but fixes the problem I saw. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	fs: make quota/dquot.c explicitly non-modular	Paul Gortmaker	2015-12-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Kconfig currently controlling compilation of this code is: config QUOTA bool "Quota support" ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modularity so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering gets bumped to one level earlier when we use the more appropriate fs_initcall here. However we've made similar changes before without any fallout and none is expected here either. We don't delete module.h because the code in turn tries to load other modules as appropriate and so it still needs that header. Cc: Jan Kara <jack@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	fs: make quota/netlink.c explicitly non-modular	Paul Gortmaker	2015-12-14	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Kconfig currently controlling compilation of this code is: config QUOTA_NETLINK_INTERFACE bool "Report quota messages through netlink interface" ...meaning that it currently is not being built as a module by anyone. Lets remove the couple traces of modularity so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering gets bumped to one level earlier when we use the more appropriate fs_initcall here. However we've made similar changes before without any fallout and none is expected here either. Cc: Jan Kara <jack@suse.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jan Kara <jack@suse.cz>
* \| \|	Merge branch 'akpm' (patches from Andrew)	Linus Torvalds	2016-01-15	72	-262/+252
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge first patch-bomb from Andrew Morton: - A few hotfixes which missed 4.4 becasue I was asleep. cc'ed to -stable - A few misc fixes - OCFS2 updates - Part of MM. Including pretty large changes to page-flags handling and to thp management which have been buffered up for 2-3 cycles now. I have a lot of MM material this time. [ It turns out the THP part wasn't quite ready, so that got dropped from this series - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) zsmalloc: reorganize struct size_class to pack 4 bytes hole mm/zbud.c: use list_last_entry() instead of list_tail_entry() zram/zcomp: do not zero out zcomp private pages zram: pass gfp from zcomp frontend to backend zram: try vmalloc() after kmalloc() zram/zcomp: use GFP_NOIO to allocate streams mm: add tracepoint for scanning pages drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64 mm/page_isolation: use macro to judge the alignment mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE() mm: rework virtual memory accounting include/linux/memblock.h: fix ordering of 'flags' argument in comments mm: move lru_to_page to mm_inline.h Documentation/filesystems: describe the shared memory usage/accounting memory-hotplug: don't BUG() in register_memory_resource() hugetlb: make mm and fs code explicitly non-modular mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd() mm: make sure isolate_lru_page() is never called for tail page vmstat: make vmstat_updater deferrable again and shut down on idle ...
\| * \| \|	mm: rework virtual memory accounting	Konstantin Khlebnikov	2016-01-15	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which testing the RLIMIT_DATA value to figure out if we're allowed to assign new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been commited that RLIMIT_DATA in a form it's implemented now doesn't do anything useful because most of user-space libraries use mmap() syscall for dynamic memory allocations. Linus suggested to convert RLIMIT_DATA rlimit into something suitable for anonymous memory accounting. But in this patch we go further, and the changes are bundled together as: * keep vma counting if CONFIG_PROC_FS=n, will be used for limits * replace mm->shared_vm with better defined mm->data_vm * account anonymous executable areas as executable * account file-backed growsdown/up areas as stack * drop struct file* argument from vm_stat_account * enforce RLIMIT_DATA for size of data areas This way code looks cleaner: now code/stack/data classification depends only on vm_flags state: VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc) VM_GROWSUP \| VM_GROWSDOWN -> stack (VmStk) VM_WRITE & ~VM_SHARED & !stack -> data (VmData) The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called "shared", but that might be strange beast like readonly-private or VM_IO area. - RLIMIT_AS limits whole address space "VmSize" - RLIMIT_STACK limits stack "VmStk" (but each vma individually) - RLIMIT_DATA now limits "VmData" Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kees Cook <keescook@google.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| * \| \|	hugetlb: make mm and fs code explicitly non-modular	Paul Gortmaker	2016-01-15	1	-25/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The Kconfig currently controlling compilation of this code is: config HUGETLBFS bool "HugeTLB file system support" ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering gets moved to earlier levels when we use the more appropriate initcalls here. Originally I had the fs part and the mm part as separate commits, just by happenstance of the nature of how I detected these non-modular use cases. But that can possibly introduce regressions if the patch merge ordering puts the fs part 1st -- as the 0-day testing reported a splat at mount time. Investigating with "initcall_debug" showed that the delta was init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So both the fs change and the mm change are here together. In addition, it worked before due to luck of link order, since they were both in the same initcall category. So we now have the fs part using fs_initcall, and the mm part using subsys_initcall, which puts it one bucket earlier. It now passes the basic sanity test that failed in earlier 0-day testing. We delete the MODULE_LICENSE tag and capture that information at the top of the file alongside author comments, etc. We don't replace module.h with init.h since the file already has that. Also note that MODULE_ALIAS is a no-op for non-modular code. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Reported-by: kernel test robot <ying.huang@linux.intel.com> Cc: Nadia Yvette Chambers <nyc@holomorphy.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| * \| \|	mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in ↵	Oleg Nesterov	2016-01-15	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	clear_soft_dirty_pmd() clear_soft_dirty_pmd() is called by clear_refs_write(CLEAR_REFS_SOFT_DIRTY), VM_SOFTDIRTY was already cleared before walk_page_range(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| * \| \|	proc: meminfo: estimate available memory more conservatively	Johannes Weiner	2016-01-15	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The MemAvailable item in /proc/meminfo is to give users a hint of how much memory is allocatable without causing swapping, so it excludes the zones' low watermarks as unavailable to userspace. However, for a userspace allocation, kswapd will actually reclaim until the free pages hit a combination of the high watermark and the page allocator's lowmem protection that keeps a certain amount of DMA and DMA32 memory from userspace as well. Subtract the full amount we know to be unavailable to userspace from the number of free pages when calculating MemAvailable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
\| * \| \|	fs/block_dev.c:bdev_write_page(): use blk_queue_enter(..., GFP_NOIO)	Andrew Morton	2016-01-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	bdev_write_page() is used by swapout and by writepage where we cannot use __GFP_FS or __GFP_IO. So it is misleading to mention GFP_KERNEL here. blk_queue_enter() only actually looks at __GFP_DIRECT_RECLAIM, so no bugs were harmed in the making of this patch. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jens Axboe <axboe@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>